Psychometrika 


VOLUME XII —1947 
JANUARY-DECEMBER 





Editorial Council 


Chairman:—L. Is, THURSTONE Managing Editor: — 


Editors:—A. K. Kurtz HAROLD GULLIKSEN 
M. W. RICHARDSON Assistant Managing Editor:— 
DOROTHY C. ADKINS 


Editorial Board 
H. S. CONRAD K. J. HOLZINGER M. W. RICHARDSON 


ELMER A. CULLER PAUL Horst P. J. RULON 

E. E. CURETON ALSTON S. HOUSEHOLDER WM. STEPHENSON 
JACK W. DUNLAP TRUMAN L. KELLEY S. A. STOUFFER 
MAX D. ENGELHART ALBERT K. KURTZ GODFREY THOMSON 
HENRY E. GARRETT IRVING LORGE L. L. THURSTONE 
J. P. GUILFORD QUINN MCNEMAR LEDYARD TUCKER 
HAROLD GULLIKSEN CHARLES I. MOSIER S. S. WILKS 
CHARLES M. HARSH HERBERT WOODROW 








PUBLISHED QUARTERLY 


By THE PSYCHOMETRIC SOCIETY 
AT 23 WEST COLORADO AVENUE 
COLORADO SPRINGS, COLORADO 

















Psychometrika 





CONTENTS 
TEST “RELIABILITY”: ITS MEANING AND DETERMINA- 
We eee ee ee eee ee 
LEE J. CRONBACH 
TABLE FOR DETERMINING PHI COEFFICIENTS - - 17 


C. E. JURGENSEN 


THE USE OF PSYCHOLOGICAL TECHNIQUES IN MEAS- 
URING AND CRITICALLY ANALYZING NAVIGA- 
TORS’ FLIGHT PERFORMANCE - - - - - 8l 


LAUNOR F. CARTER AND FRANK J. DUDEK 


ANALYSIS IN TERMS OF FREQUENCIES OF 
DIFFERENCES - - - - - = = = - = 48 


HAROLD A. VOSS 


AN INDEX OF ITEM VALIDITY PROVIDING A CORREC- 
TION FOR CHANCE SUCCESS - - - - - il 


A. P. JOHNSON 


HAROLD CRAMER. Mathematical Methods of Statistics. 
A Review - - -©+ ©+-5= = = = © «© w#© = §& 
ALFRED L. BALDWIN 


FREDERICK B. DAVIS: Item-Analysis Data: Their Compu- 
tation, Interpretation, and Use in Test Construction. 
A Review - - - - - - = = = - += = 60 
ROBERT L. THORNDIKE 








VOLUME TWELVE MARCH 1947 NUMBER ONE 








PSYCHOMETRIKA—VOI.. 12, No. 1 
MARCH, 1947 


TEST “RELIABILITY”: ITS MEANING AND DETERMINATION 


LEE J. CRONBACH 
UNIVERSITY OF CHICAGO 


The concept of test reliability 1s examined in terms of general, 
group, and specific factors among the items, and the stability of 
scores in these factors from trial to trial. Four essentially different 
definitions of reliability are distinguished, which may be called the 
hypothetical self-correlation, the coefficient of equivalence, the co- 
efficient of stability, and the coefficient of stability and equivalence. 
The possibility of estimating each of these coefficients is discussed. 
The coefficients are not interchangeable and have different values in 
corrections for attentuation, standard errors of measurement, and 
other practical applications. 


The literature of testing contains many discussions of test re- 
liability. Each year, new formulations are offered, and new proce- 
dures tor estimating reliability are championed. There appears to 
have developed no universally accepted procedure, and several writ- 
ers have attributed this difficulty to the diversity of definitions for 
reliability now in use. It has often been suggested that perhaps the 
only effective way to resolve the conflicts among contending view- 
points is to replace the term “reliability,” recognizing that it covers 
not one, but several concepts. The present paper attempts to restate 
the conflicting concepts and assumptions now current, and to offer a 
scheme for separating the various aspects of dependability of meas- 
urement. 

The physical scientist. generally has expressed the accuracy of 
his observations in terms of the variation of repeated observations 
of the same event. The mean of the squared deviations of these ob- 
servations about the obtained mean is the “error variance.” This is 
a measure of precision or reliability. If for the present we regard 
reliability as the consistency of repeated measurements of the same 
.event by the same process, two fundamental differences between the 
problem of the physical scientist and the psychologist appear. The 
physical scientist makes two assumptions, both of which are ade- 
quately true for him. First, he assumes that the entity being meas- 
ured does not change during the measurement process. By control- 
ling the relevant conditions—and he usually knows what these condi- 
tions are and can control them—he can hold nearly constant the 
length of a rod or the pressure of a gas. When measuring a variable 


1 








2 PSYCHOMETRIKA 


quantity, where his assumption is no longer valid, he abandons the 
method of successive observations and employs instead simultaneous 
observations. The psychologist cannot obtain simultaneous measure- 
ments of behavior, yet the quantities that interest him are always 
variable and the method of successive measurements requires an im- 
possible assumption. The psychologist may wish to measure a hypo- 
thetical constant (aptitude, or a limen), but all he can ever observe 
is behavior, which is always shifting. It is one thing to test the ac- 
curacy of measurement of a quantity, quite another to test whether 
that quantity is constant. Judgment on the second question must 
await judgment on the first. 

The second assumption of the physical scientist is that his meas- 
urements are independent. If one rules out his remembering prior 
measurements, this assumption can usually be made true. Successive 
measurements of psychological quantities are rarely independent, 
however, because the act of measurement may change the quantity. 
London (10) has recently described this difficulty by the physicist’s 
term hysteresis. 

The reliability of a test score has generally been defined in terms 
of the variation of scores obtained by the individual on successive 
independent testings. Neither the assumption of constancy of true 
scores nor the assumption of experimental independence is realized 
in practice with most psychological variables; therefore, the reliabil- 
ity of a test, as so defined, is a concept which cannot be directly ob- 
served. If there is no standard of truth, it is fruitless to compare one 
estimate with another and debate which is more correct. But by vari- 
ous assumptions which usually cannot be tested, we obtain usable sta- 
tistics which describe the test. Different assumptions lead to differ- 
ent types of coefficients, which are not estimates of each other. In 
particular, as many writers have noted, an estimate of the stability 
of a test score is not at all the same as an estimate of the accuracy 
of measurement of behavior at any one instant. Jenkins cites Fran- 
zen’s comments on certain physiological measures which have high 
split-half “reliabilities” and low retest “reliabilities” (6). The meas- 
uring technique may be extremely accurate in reporting a biological 
instant in the life of an individual but not measure a stable character- 
istic of the individual. 

Both the physicist and the psychologist encounter the problem 
of observer error. In sighting through a telescope or scoring an es- 
say test, there is likely to be appreciable constant and variable error 
in observing. If one compares several judgments by the same ob- 
server, he includes the variable errors of observation with the errors 
of measurement. Hence, he studies the reliability of “this measuring 











LEE J. CRONBACH 3 


instrument used by this man.” If scores obtained by several observ- 
ers in simultaneous measurements are pooled for comparison, the 
constant error of each man is included as a source of variation» This 
procedure studies the reliability of “this measuring instrument used 
by different men.” Since the human takes part in the measurement, 
one cannot study the reliability of an instrument apart from the men 
who use it. 


Types of “Reliability” 
It is known that 


iis 

ry =1—-—, (1) 
or 

where 7; is the reliability coefficient, o.2 is the hypothetical error 

variance—the mean of the squared deviations of all obtained scores 

for each person from the mean obtained score for that person—and 

o;? is the variance of the scores of all persons on all the hypothetical 

independent trials. 

It is convenient to consider the possible definitions of error of 
measurement in terms of variance. Using a bi-factor pattern to de- 
scribe a test,* the variance of scores from a single testing may be 
expressed as follows: 


o,? =a0,? + oy. ae of i gh os" + oe? +++ + 0,7 + oe?. (2) 


The terms have the following meanings: 


o,” is the variance of obtained scores; 

oc,” is the variance in the general factor (if any) represented in 
the test items; 

oj? , o)°, etc. , are the respective variances in the orthogonal group 
factors of undetermined number, each of which is represented in two 
or more items; 

Gs", o,”, etc., are the respective “specificities” of the n items— 
the part of the reliable variance of scores on the items which cannot 
be assigned to common factors; and oe" like the residual variance. 


The referents for these factors may be illustrated in a hypo- 
thetical examination in psychology. The general factor might include 


general knowledge of psychology, reading ability, motivation, and 


* Another factor pattern could be assumed without changing the basic argu- 
ment (4, 7-9, 107). 


Se 





cE, AR RES SAO AT 





4 PSYCHOMETRIKA 


other characteristics. Group factors might be related to knowledge 

of separate topics, mathematical skill required in only a few items, 

and so on. Each item taps, in addition, some specific knowledge not 

demanded by other items. The specificity variance accounts for in- 

dividual differences in these elements. The remaining variance may 

include momentary inattention, guessing, and other random elements. 
For reference, the formula will be rewritten thus: 


o2—07 + do? +> oe" oF oe". (3) 


Consider now the scores obtained from a series of independent 
measurements of the same individuals using the same test. 


eee te tee te te, eee + oe’. (4) 
Gr le Siz Ix ? Sic 
o:? is the variance of all obtained scores about the grand mean; 


so? is the variance of the mean general factor scores of all indi- 


Gz 
viduals about the mean for all individuals—the between-persons vari- 
ance in g; 


o° is the between-persons variance in a group factor; 
Ie 
a is the between-persons variance in specificity on any item; 


> o,° is the sum over individuals of the variances of the general- 


factor scores for each individual about the mean for that individual— 
the within-persons variance; 


BS o/? and SJ & a,’ represent the corresponding within-persons 
z iw 
variances in the group factors and specificities, respectively ; and 


oe" is the residual variance. 
The between-persons variances represent, as in the case of the 
single trial, individual differences in the factors. The within-persons 
variances represent instability of scores for each individual, as a re- 
sult of changes from test to test. 

These formulations permit an exact statement of what a “reli- 
ability coefficient” represents. Apparently at least four fundamen- 
tally different meanings of reliability are current: 


(1) The “error variance” may be permitted to include, in equa- 
tion (4), the terms > o,?, ape o/?, ty os; and oe’. That is, insta- 
bility is regarded as an error of measurement. This is the coefficient 
defined by the correlation from repeated independent administrations 

















LEE J. CRONBACH 5 


of the same test. The assumption of constancy is made, since any 
change of score from trial to trial is treated as an error of measure- 
ment. If that assumption is true, the instability terms vanish, but 
such constancy in all the behaviors a test measures is highly unlikely. 


(2) The “error variance” may be permitted to include, in equa- 
tion (4), the terms So? ,> o Pe abe or »LD Do ,and oe". Both 


instability and specificity are treated as errors. This is the “reliabil- 
ity” defined by the correlation between successive independent admin- 
istrations of equivalent tests. Because different items are used in 
preparing equivalent forms, the specific-factor scores of individuals 
on the two tests will be uncorrelated. These, therefore, contribute to 
changes in score and are treated as error. If the tests do not repre- 
sent the same group factors, at least part of » * is also added to the 


error variance. 


(3) The “error variance’ may be permitted to include in equa- 
tion (3), the terms os. and oe". This defines “reliability” as the cor- 


relation between two equivalent tests administered simultaneously. 
Instability is excluded from consideration, and no assumptions of con- 
stancy are made. Specific-factor variances are included in errors of 
measurement. Depending on the degree of equivalence, part of the 
group-factor variance may also be treated as error. 


(4) The “error variance” may be restricted, in equation (3), 
to the term oe*. This is “reliability” defined as the self-correlation of 
a test (see below). No assumption of constancy is made, and inde- 
pendence is not involved. The specific factors remain the same from 
test to test and are added to the true-score variance. All real vari- 
ables measured by the test are treated as quantities estimated, not as 
errors. 


It may now be helpful to restate these definitions and to give 
them names for reference. 


Definition (1): Reliability is the degree to which the test 
score indicates unchanging* individual differences in any 
traits. (Coefficient of stability). ‘ 


Definition (2): Reliability is the degree to which the test 
score indicates unchanging individual differences in the gen- 
eral and group factors defined by the test. (Coefficient of 
stability and equivalence). 


* This may be modified by requiring constancy over some specified period 
(one year, one day, etc.) 








6 PSYCHOMETRIKA 


Definition (3): Reliability is the degree to which the test 
score indicates the status of the individual at the present in- 
stant in the general and group factors defined by the test. 
(Coefficient of equivalence). Internal consistency tests are 
generally measures of equivalence. These coefficients pre- 
dict the correlation of the test with a hypothetical equiva- 
lent test, as like the first test as the parts of the first test 
are like each other. 


Definition (4): Reliability is the degree to which the test 
score indicates individual differences in any traits at the 
present moment. (Hypothetical self-correlation). 


These names are open to criticism, and better suggestions are in or- 
der. The important thing is to recognize that in the past all four of 
these and many approximations to them have been called “the reli- 
ability coefficient.” No one of these is the “right” coefficient. They 
measure different things, and each is useful. What is important is 
to avoid confusing one with another, and using one as an estimate of 
another. It may be noted that reliability of a test can only be dis- 
cussed in relation to a particular sample of persons. 

The components of error variance under each definition imply 
that in practice some coefficients will be larger than others for a giv- 
en test. If stability is not perfect, and if items contain some spe- 
cificity loading, the hypothetical self-correlation will be greatest, and 
the coefficient of stability and equivalence will be the smallest of the 
four. 

As Kelley states (7), the concept of reliability is meaningless un- 
less one postulates that two measures of the same function exist. 
They may be successive measurements of a stable event, or simul- 
taneous measurements of a unique event. But in regard to the non- 
repeating event which can be observed only once, reliability has only 
a theoretical interest. In fact, if one accepts a deterministic position, 
there is no “error” in a measurement of a unique event. The stu- 
dent’s responses and his score are determined by many forces, and 
we do not know what they are; but the resultant of these forces is a 
particular act, and the act itself, at this instant and with these par- 
ticular forces, is perfectly reliable. “Chance” and “error” are merely 
names we give to our ignorance of what determines an event. 

All methods of studying reliability make a somewhat fallacious 
division of variables into “real variables” and “error.” It is prob- 
baly more correct to conceive a continuum between the instantaneous 
behavior which has an infinitesimal period, through states of longer 
duration, to the virtually constant individual differences. A test score 











LEE J. CRONBACH 7 


is made up of all these “real” elements, each of which could be per- 
fectly predicted if our knowledge were adequate. Reliability, accord- 
ing to this conception, becomes a measure of our ignorance of the 
real factors underlying brief fluctuations of behavior and atypical 
acts. Perhaps a new statistical method based on the non-Aristotelian 
conception of a continuum of realities will some day permit us to 
avoid the troublesome attempt to divide the continuum into “reality” 
and “error.” 

For the present, it appears to be necessary to retain the artificial 
separation. In thinking about the self-correlation of a test—the con- 
sistency with which it measures whatever it measures—we may class 
as chance effects all variables whose period of variation is shorter 
than the time required to take the test. Momentary fluctuations are 
therefore “errors,” but shifts in fatigue, set, or skill having a longer 
cycle are possibly worth measuring. 


Techniques of Estimation 


Each method used in the past to study “reliability’’ may be asso- 
ciated with one of these definitions. The procedures requiring more 
than one trial will be discussed first. 


Retest method. The retest method calls for giving the same test 
twice to the same group. The trials are supposed to be independent, 
but this may well not be true. Shift in relative scores is always treat- 
ed in the error variance, not the true-score variance; the retest co- 
efficient is therefore an estimate of the coefficient of stability. Fail- 
ure to attain independent trials may make the estimate too high or 
too low. 

Guttman (3, 263), in a complete reconsideration of reliability 
theory, defines reliability in terms of the stability of individual dif- 
ferences during a large number of “independent” retests. He shows 
that the reliability thus defined (a coefficient of stability) may be 
estimated by the correlation between two independent trials. His def- 
inition of independence will be discussed below. 


Equivalent tests method. Two “equivalent” or “parallel” tests 
may be given, with any interval between, and their correlation de- 
termined. Experimental independence is assumed, despite the effect 
experience with one form may have on the second. Constancy is as- 
sumed, and all shifts in relative score are treated in the error vari- 
ance. Specific-factor variances are treated in the error variance. This 
is therefore an estimate of the coefficient of stability and equivalence. 
Because the assumption of independence cannot be tested, it is never 








8 PSYCHOMETRIKA 


known whether the estimate is high or low. To interpret a coefficient 
involving equivalence, one must know how the tests are equivalent. 
If the tests are alike only in the general factor, group-factor vari- 
ances are included as error, and the coefficient reflects the extent to 
which scores are determined by a stable general factor. Parallel tests 
should ordinarily have the same general and group factors. Were 
items in the two forms matched to test the same specific items of in- 
formation or skill, the equivalent tests might to some degree include 
the same specific factors. The specific factors in the two tests could 
not be completely the same, however, unless the items were identical. 
The coefficient of equivalence is a property of a pair of tests and will 
vary according to the kind of similarity established in equating the 
tests. To the degree that parallel tests have the same general and 
group factors, the coefficient indicates the stability of inane 
in the general and group factors. 


The split-half method. The widely used split-half method requires 
the correlation of half the items in the test with the remaining items. 
Cronbach has studied the effect of various splits upon the resulting 
coefficient (1) and has suggested the use of parallel splits, in which 
the two halves are made nearly equivalent (2). In the parallel split, 
each part represents the general factor and the group factors of the 
original test as well as possible. The half-tests should have equal 
standard deviations. The procedure makes no assumption of con- 
stancy, but does include the specific-factor variance as error variance. 
The split-half estimate is a coefficient of equivalence, estimating the 
correlation of simultaneously administered parallel tests, as like each 
other as are the halves of the test given. Any failure in splitting to 
obtain equivalent halves will tend to lower the correlation obtained. 
An assumption of experimental independence is made in considering 
the split-half correlation an estimate of the parallel-test correlation. 
In testing by parallel tests, the performance on one form is presum- 
ably independent of performance on the other. When items are pre- — 
sented together, however, there is always the possibility of spurious 
inter-item correlation due to item linkages and brief fluctuations of 
mood and attention. 

Most random or odd-even splits do not represent all factors 
equally in both halves. If the assumption of experimental indepen- 
dence were valid, the correlation would therefore be an underestimate 
of the coefficient of equivalence. Guttman (3, 260) states that the 
corrected split-half coefficient is always a lower bound to “the reli- 
ability coefficient,” no matter how the test is split. He cautions that 
this inequality is true only for an indefinitely large sample of per- 
sons. Sampling errors in practice preclude taking as one’s coefficient 











LEE J. CRONBACH 9 


the largest of many trial split coefficients. Guttman defines reliabil- 
ity in terms of repeated independent trials of the same (not equiva- 
lent) tests. By this definition, the split-half estimate, including spe- 
cificity as an error of measurement, is a low one. The coefficient of 
equivalence is a conservative estimate of the hypothetical self-corre- 
lation. 

The assumptions of the Spearman-Brown formula have been 
stated in various ways, and this has led to some confusion as to the 
applicability of the formula. The derivation hypothecates equivalent 
tests and predicts their correlation from the correlation of equivalent 
half-tests. Equivalence is the only assumption made, and in the deri- 
vation equivalence is defined by requiring equal standard devia- 
tions of the half-tests and by requiring that the hypothetical equiva- 
lent tests be just as similar as the half-tests (7a = ‘as = Tvs = Tas). 
This defines equivalence so that all tests have the same common factor 
composition. It makes no direct assumption of the equivalence of pairs 
of items or of the unit-rank among the item intercorrelations. 

The items of a test may be considered as a sample of some larger 
population. One may define the purpose of the test in terms of the 
population of items to be measured; the test fulfils this purpose inso- 
far as the items are a representative sample of the population. Alter- 
natively, one may consider the test as defined by its items, and think 
of the population as the entire group of items of which the sample 
is representative. The coefficient of equivalence (obtained by the 
parallel-test or internal consistency methods) correlates two samples 
of items and indicates the extent to which the variance in each may 
be attributed to common factors. The extent of common-factor load- 
ings is the extent to which test scores are determined by “the popula- 
tion variable.’”’ If the samples to be compared must be representative, 
rather than random, it is necessary, in split-half procedures, to use 
the parallel split or a split according to a table of specifications. 


The Kuder-Richardson formulas. A radical reformulation of the 
reliability problem was offered in 1937 by Kuder and Richardson (8). 
They proposed several alternative formulas which have been widely 
adopted. The original derivation has been criticized because of the 
numerous assumptions made, but other writers have developed the 
same formulas more directly. Perhaps the simplest derivation was 
published by Jackson and Ferguson (5, 74). They define reliability 
as a coefficient of equivalence, equivalence being defined by requiring 
that the two tests have equal variances and that the mean inter-item 
covariance within each test be equal and equal to the mean inter-item 
covariance between tests. If these assumptions are satisfied, the 
Kuder-Richardson formula (20) is an exact estimate of the coefficient 








10 PSYCHOMETRIKA 


of equivalence. This condition is a reasonable one when the items of 
a test are considered as drawn from a population of items all meas- 
uring a single general factor. If group factors are present, even 
though the two tests measures these group factors equally, then, 


r; SiS; < 7;; SiS;-,* and the Kuder-Richardson formula gives a con- 
servative estimate of the coefficient of equivalence—how conservative 
one does not know. 


The Guttman lower bounds. The latest statement of the problem 
is that published by Guttman in 1945 (3). He derives six formulas 
for estimating a coefficient from data obtained on a single testing, 
all the estimates being lower than the “true reliability” if the sample 
is sufficiently great. His estimate L, is identical to that from Kuder- 
Richardson formula (20), although the derivations are dissimilar. 
His L, is equivalent to the split-half coefficient. L., which uses item 
covariances, is an original formula more difficult to compute than L, 
and L,. L,,L;, and L, are expected to have little practical importance. 


Guttman defines error as the variation of the score of a person 
over a universe of independent trials with the same test. His crucial 
assumption, C, (3, 265-266), defines independence so that the score 
of a person on any item on any trial is experimentally independent 
of his scores on any other items. In practice, changes in motivation, 
function shift, and other variables cause items administered together 
to vary together. Guttman classes shifts in the variables measured 
as errors of measurement and therefore is estimating a coefficient of 
stability when he demonstrates that the correlation between two in- 
dependent trials on a large population may be taken as equal to “the 
reliability coefficient” (3, 268). 

In deriving lower-bounds formulas, Guttman deals with hypo- 
thetical independent retests in which the mean covariance of two 
items within trials equals the mean covariance of the same items be- 
tween trials. Beyond this he makes no assumption. His definition of 
independence requires that there be no shift in the variables meas- 
ured between trials; i.e., that the hypothetical trials be simultaneous. 
Since he is using identical tests simultaneously, he has defined reli- 
ability as the hypothetical self-correlation. His formulas lead to un- 
derestimates of that coefficient. 

One may study the effect on Guttman’s results if his assumption 
of independence within trials is denied. This may occur when one 
item influences the answer to another by giving a clue, by causing 
encouragement or discouragement, or by setting up a pattern among 


* ie., the mean inter-item covariance within tests is less than the mean inter- 
item covariance between tests. 











LEE J. CRONBACH 11 


the responses. In the derivation of L,, the assumption leads to dis- 
carding a positive covariance term from the right member of (28). 
As a consequence, 4; and L, are greater than they would be without 
the assumption, and may overestimate the hypothetical self-correla- 
tion as defined. In the derivation of L., L;, and L,, the assumption 
is felt in (25), where a positive covariance term is dropped from the 
right member. Without the assumption, 


rc y 
Yer 7; a 7X xX, » Gg a J> 


and the inequality given in (37) may not hold. The remainder of the 
derivation therefore may lead to estimates higher than the hypotheti- 
cal self-correlation, if the assumption of experimental independence 
of items does not hold. 

This weakness is common to all estimates of reliability based on 
a single trial. Lindquist (9, 219) points out that in the split-half 
method the two halves are falsely assumed to be experimentally inde- 
pendent, and therefore he considers the split-half estimate spuriously 
high. [He, however, defines reliability as what we have called the 
coefficient of stability and equivalence (9, 216)]. In the Kuder-Rich- 
ardson formula, as derived by Jackson and Ferguson, the same as- 
sumption of independence is made when the mean inter-item covari- 
ance between tests is taken as equal to the mean covariance within 
tests. If motivation, response sets, and other factors common to per- 
formance on the various items of a trial are considered part of the 
general or group factors measured by the test, their contribution to 
the inter-item correlation within a trial is rightly included in the 
estimate of accuracy of measurement. But momentary variations 
which cavse random changes in item covariance should not be per- 
mitted to raise the estimate obtained. Any estimate of self-correla- 
tion or equivalence based on a single trial may be higher than the 
hypothetical self-correlation. It may be treated as a conservative or 
exact estimate only if we are willing to assume that the response to 
each item is an independent behavior, related to response on other 
items only because of significant conditions in the person tested. 

Guttman makes the point that his split-half formula 


8,? + 8,? ) 


8;? 


(5) 





L.=2(1- 


is superior to the Spearman-Brown formula in that it does not assume 
the two half-tests to have equal variance. His formula can be derived 
as an estimate of the coefficient of equivalence, according to the usual 
proof of the Spearman-Brown formula, except that equivalence is de- 








12 PSYCHOMETRIKA 


fined so that ou.» = oasg, ANd Tesgc04 = Pupoaee = 1 ayoao = Yosoves = 
Tapas, . This leads to a formula identical to Guttman’s, or an equiva- 
lent form previously derived by Flanagan (see Kelley, 7) which is 
less readily computed. Values obtained using this formula are small- 
er (usually by a small amount) than the values from the Spearman- 
Brown formula, except where s, = s,. It appears that this formula 
should replace the Spearman-Brown procedure. 


Summary 

Four possible definitions of ‘reliability’? have been considered. 
The hypothetical self-correlation requires independent simultaneous 
identical tests. For psychological variables this is a hypothetical sit- 
uation, and no one has found an unbiased estimate of this coefficient. 
Guttman’s formula L. would be a conservative estimate of the hypo- 
thetical self-correlation, save for the necessity of assuming that re- 
sponses to one item are not influenced by responses to another item. 
Guttman’s L, is ordinarily greater than the estimate from the Kuder- 
Richardson formula. 

The coefficient of equivalence is lower than the hypothetical self- 
correlation. Kuder-Richardson formula (20) is an exact estimate of 
the coefficient of equivalence for tests where the item intercorrelation 
matrix has rank one; otherwise the estimate is conservative. This, 
however, like all estimates of equivalence, assumes experimental in- 
dependence of items within one trial. The parallel-split method gives 
an estimate of the coefficient. of equivalence. For an ideally large 
population, the highest split-coefficient is the best estimate, and esti- 
mates from other splits are conservative, save for the failure of in- 
dependence of items. 

The coefficient of stability is lower than the hypothetical self- 
correlation. It is estimated by the test-retest correlation, but carry- 
over from one test to another may cause the estimate to be faulty. 

The paraliel-tests correlation is an estimate of the coefficient of 
stability and equivalence. It may be unduly high if the two tests are 
not experimentally independent. Otherwise, the estimate will ordi- 
narily be lower than the coefficient of stability or the coefficient of 
equivalence. 

A simple table may indicate the different meanings of the vari- 
ous procedures. In Table 1, checks indicate the variances which are 
included in the error of measurement, according to each procedure. 
In the absence of sampling error, any estimate of reliability is less 
than the hypothetical self-correlation, assuming experimental inde- 
pendence. Every procedure assumes either the experimental indepen- 
dence of trials or of items within the trials. This condition is rarely 











LEE J. CRONBACH 13 


satisfied, and any obtained coefficient may therefore be higher than 
the coefficient supposed to be obtained. 


TABLE 1 
Variances Included in Error Variance of a Test, According to 
Various Formulations of the Reliability Problem* 


m oa 
we ° 
g A § 2 
ov Oo 
Saas 5o py 2 
@ © 9 Soh 3s Pom > = 
—_— 9 OW 950 een’ og om s 2 © 2 
Fb 255 28s S96 Sa B98 £3 
v-m 3 em Opes CS. ® C.5 5 BS o3 
SO SUH DOH BVHE HHO HRD HB 
ons Oe a8 8 8 o 2c mM oS & m 
Cn> On> am> FE>o €>u Eon we 
Test-Retest x x x x 
Parallel Test x x x x x 
Parallel Split x x 
Random Split x x x 
Kuder-Richardson (20) x x x 
Guttman L, xt 
Hypothetical Self-Correlation x 
Coefficient of Equivalence x, = 
Coefficient of Stability x % x x 
Coefficient of Stability 
and Equivalence x x x x x 


* An 2 indicates that the variance indicated is included in the error of measurement by the pro- 
cedure or definition listed at the left. 3 

_tIn equations (31) and (43), Guttman sets up inequalities which overestimate the item error 
variance. 


Practical Implications 

No one “best” estimate of reliability exists. If one could validly 
make the assumption of stability between trials, and independence of 
trials, the test-retest correlation would be satisfactory. Frequently 
we must rely on single-trial estimates. Guttman’s L, or a parallel- 
split used with his L; will in general give the highest coefficients. 
Where the test measures a single factor, the Kuder-Richardson for- 
mula (Guttman’s L,) should be as useful as the other two procedures. 

In many situations, it is appropriate to seek a coefficient other 
than the hypothetical self-correlation. In correcting for attenuation, 
any of the coefficients described in this paper may be appropriate. 
Following the lead of Remmers and Whisler (11), one may distin- 
guish between the “true instantaneous score” in a variable (related 
to the self-correlation or the coefficient of equivalence) and the “true 
score” in a trait (related to the coefficient of stability or of stability 
and equivalence). Sometimes one wishes to know the correlation be- 
tween true scores in two traits postulated as stable over a period of 
time—“somatotype” vs. “temperament” is a typical problem. Here 
the appropriate coefficients for use in the attenuation formula are the 


Items 

















14 PSYCHOMETRIKA 


coefficient of stability (if the trait is defined operationally by a spe- 
cific test) or the coefficient of stability and equivalence (if the trait 
is defined by a family of similar tests). Other problems call for study- 
ing the relation between true instantaneous score in one variable 
(such as an aptitude test) and true score in another defined as stable 
(such as job performance). For this, the reliability of the former 
score would be based on a coefficient of equivaience (since the hypo- 
thetical self-correlation is not known), and the reliability of the lat- 
ter would be based on one of the coefficients involving stability. The 
third possibility, and one of much theoretical importance, is a prob- 
lem regarding true instantaneous scores in two variables, such as 
mood and performance. The correction for attenuation here requires 
use of two coefticients of equivalence. 

Similar reasoning applies to the problem of estimating the sig- 
nificance of changes in test score. If the identical test is given both 
times, the coefficient of stability is appropriate. The hypothetical self- 
correlation, if known, would test whether a significant change in be- 
havior had occurred, although this change might be due to normal 
diurnal fluctuation. The coefficient of stability tests whether the 
change is greater than that “normally” to be expected due to function 
fluctuation. If growth is measured by equivalent tests, a coefficient 
of equivalence, or of stability and equivalence, is rejievant. 

In evaluating a test, all four coefficients are of interest. For 
most purposes, one wishes to measure stable characteristics, so that 
a coefficient of stability is needed. For research purposes, however, 
a test having high instantaneous self-correlation or equivalence and 
low stability may be very satisfactory. 

The coefficient of stability is an abstraction; in reality, there is 
an indefinitely large number of such coefficients, corresponding to 
various time intervals between tests. For meaningful use of such a 
coefficient, it must be defined as “the coefficient of stability over one 
week,” or the iike. The coefficient also depends on the conditions af- 
fecting the subject between testings. Strictly speaking, a coefficient 
of stability may be carried over to a new situation only when the time 
interval and the conditions between testings are similar to those un- 
der which the coefficient was obtained. The coefficient of stability 
would be better understood if research were available showing how 
the coefficient varies with increasing time lapse. 

The following recommendations result from the analysis made 
above. 

1. Reliability for psychological measurement can never be ob- 
served as in the physical sciences, where variables are practically 
constant and non-hysteretic. All estimates of reliability require as- 











LEE J. CRONBACH 15 


sumptions unlikely to be fulfilled. 

2. Several coefficients numerically less than the hypothetical 
self-correlation can be estimated. A distinction between these vari- 
ous coefficients should be made; the writer proposes the names coef- 
ficient of equivalence, coefficient of stability, and coefficient of sta- 
bility and equivalence. 

3. The coefficient of equivalence may be estimated by the par- 
allel-split method, using formula (5),-Guttman’s L,. The Kuder- 
Richardson formula (20) underestimates this coefficient unless the 
test item matrix has rank one. Guttman’s L. gives an underestimate 
of the hypothetical self-correlation which may or may not be higher 
than the coefficient of equivalence. All estimates of reliability or 
equivalence based on a single trial assume that test items are experi- 
mentally independent. To the extent that this is untrue, estimates 
may be erroneously high. 

4. The coefficient of stability may be estimated by the test-re- 
test method, with an undetermined error due to failure of indepen- 
dence. The coefficient of stability and equivalence may be estimated 
by the correlation of parallel tests, with a similar error. 

5. In describing a test, the author should provide separate es- 
timates of the coefficient of equivalence and the coefficient of stability. 
The time interval used in obtaining the coefficient of stability should 
be reported. If there are multiple forms, the coefficient of stability 
for each should be given. 

6. In practice, the coefficient of equivalence or the coefficient of 
stability may be used meaningfully where the reliability coefficient 
is called for. The coefficients are not interchangeable and have dif- 
ferent meanings in corrections for attenuation, standard errors of 
measurement, and like applications. The hypothetical self-correlation, 
showing the extent to which a test measures real but possibly mo- 
mentary differences in performance, is more important to the theory 
of measurement than to the practical use of tests. 


REFERENCES 

1. Cronbach, L. J. A case study of the split-half reliability coefficient. J. educ. 
Psychol., in press. 

2. Cronbach, L. J. On estimates of test reliability. J. educ. Psychol., 1948, 34, 
485-494. 

3. Guttman, L. A basis for analyzing test-retest reliability. Psychometrika, 
1945, 10, 255-282. 

4. Holzinger, K. J. and Harman, H. Factorial analysis. Chicago: University 
of Chicago Press, 1941. 

5. Jackson, R. W. B. and Ferguson, G. A. Studies on the reliability of tests. 
Toronto: Department of Educational Research, Bulletin No. 12, 1941. 

6. Jenkins, J. G. Validity for what? J. consulting psychol., 1946, 10, 98-98. 








16 


11. 





PSYCHOMETRIKA 


Kelley, T. L. The reliability coefficient. Psychometrika, 1942, 7, 75-83. 
Kuder, G. F. and Richardson, M. W. The theory of the estimation of test 
reliability. Psychometrika, 1937, 2, 151-160. 

Lindquist, E. F. A first course in statistics. Boston: Houghton-Mifflin, 1942. 
London, I. D. Some consequences for history and psychology of Langmuir’s 
concept of convergence and divergence of phenomena. Psychol. Rev. 1946, 
53, 170-188. 

Remmers, H. H. and Whisler, L. Test reliability as a function of method 
of computation. J. educ. Psychol., 1938, 29, 81-92. 








PSYCHOMETRIKA—VOL. 12, NO.1 4. <: 
MARCH, 1947 


TABLE FOR DETERMINING PHI COEFFICIENTS 


C. E. JURGENSEN 
MINNEAPOLIS GAS LIGHT COMPANY 


A table.is presented which directly gives phi coefficients accu- 
rate to three places when entered by the proportion of one sub-group 
responding in a specified manner and the proportion of a second 
sub-group responding in the same manner. ‘Che table gives coeffici- 
ents identical with those obtained by formula if the sub-groups are 
equal in number. The phi coefficients can readily be expressed, if 
desired, in terms of critical ratio or chi square. The table is more 
accurate than the use of abacs and eliminates the use of time-con- 
suming formulas. Accurate determination of item validity on the 
basis of statistically rigorous techniques can be made more quickly 
by means of the table than validity determined by less efficient meth- 
ods which have previously been used to save time. 


Increasing emphasis on item analyses is being found in test con- 
struction and development, particularly with regard to item validity 
as estimated by means of internal consistency or an outside criterion. 
Long and Sandiford (5) have summarized the methods which were 
most popular a decade ago, and test coystruction since that time 
seems to have continued to follow those methods in the main. 

Methods for determining whether individual items should be re- 
tained or rejected (or the closely allied problem of determining what 
weights should be assigned each item) vary from a simple determina- 
tion of the difference in per cent of persons in contrasted groups who 
respond similarly to a given item, to the more rigorous correlational 
techniques or levels of significance. Because of the great amount of 
computational time necessary to determine item validities on the ba- 
sis of correlational techniques, the simpler and less efficient tech- 
niques have too often been used. This is particularly unfortunate in 
those cases where the number of cases is small and has often resulted 
in retaining items purely on the basis of chance differences. In an 
effort to reduce the time required for the more accurate types of item 
validation, various tables, nomographs, and abacs have been devel- 
oped. 
One of the earliest of these techniques was a table developed by 
Edgerton and Paterson (1) which gives standard errors from .001 
to.100 by successive steps of .001 for all possible combinations of 
differences between percentages. For a maximum percentage differ- 
ence of 50, this table covers a maximum standard error for 25 cases 


17 











18 PSYCHOMETRIKA 


and a minimum standard error for one million cases. These data do 
not give item validity, but do permit necessary computations to be 
made more readily than otherwise. 

Votaw (8) has published the formulas necessary to construct an 
abac based on the probable error of differences between two groups 
as well as an abac based on a critical ratio of 2 (in terms of probable 
error) when the N of each subgroup equals 45. Mosier and Mc- 
Quitty (7) have published a more detailed abac giving critical ratios 
of 2, 3, 5, 7 and 10 (in terms of standard error) based on the highest 
and lowest fifty cases of a group having a total N of 200. Mosier and 
McQuitty also published correlational abacs based on upper-lower 
halves and upper-lower quarters. Guilford (4) has published an abac 
based on phi coefficients ranging from 0 to +.90 in steps of .10 and 
another giving 1% and 5% levels of significance when the total N 
equals 50, 100, 200, and 400. 

Lord (6) has given an alignment chart for calculating the four- 
fold point correlation coefficient, the chart being entered on three 
values: per cent of cases successful with respect to the first variable, 
the similar per cent for the second variable, and the per cent of cases 
successful with respect to both variables. 

Fiske and Dunlap (2) have published a formula for construct- 
ing an ellipse on the assumption that the two sub-groups are random 
samples from the same parent population and that the best estimate 
of the true proportion is the weighted mean proportion of the two 
samples. Fiske and Dunlap’s abac is based on a critical ratio of 2 
with one hundred persons in each sub-group. 

Numerous other abacs and nomographs have been constructed. 
Although they all possess the advantage of reducing statistical com- 
putations, other objections are inherent in their use. Abacs and no- 
mographs based on critical ratios, level of significance, or chi square 
are necessarily constructed for use in situations having a specified 
number of cases in each sub-group. As the N is changed, so must the 
abac be changed also. Inasmuch as construction of an abac requires 
computing numerous points by means of formula and careful draw- 
ing of a curved line to fit these points, and because the number of 
abaecs which can be drawn is unlimited, this procedure becomes im- 
practical. 

Rather than constructing an abac for each possible N , it is also 
possible to devise abacs for various selected N’s, and in any single 
study the abac can be used which most closely approximates the avail- 
able data. Although such approximations are sufficiently accurate 
for most practical purposes, many research workers are reluctant to 
state in the literature that their work is based on approximations 








C. E. JURGENSEN 19 


lest such statement be interpreted as an indication of flatulent work. 

Another possible procedure is to discard data to the extent that 
the N fits one of the available abacs. This procedure may be satis- 
factory when data are based on several hundred cases but cannot be 
recommended when the original N is small. Necessity frequently dic- 
tates a small N, and it is inadvisable to further reduce the N in or- 
der to permit quicker analysis of data by means of an available abac. 

Another difficulty inherent in abacs is the difficulty of accurately 
determining exact item validity when interpolations must be made 
between the ellipses of an abac. The error of such estimates may be 
further increased by the necessity for interpolation at the points of 
entry on the abac. 

Some of the objections to nomographs and abacs can be over- 
come by using a table which is entered in a similar manner; namely, 
the proportion of one sub-group responding in a specified way sup- 
plies the vertical entry and the proportion of another sub-group re- 
sponding in the same way is entered in the horizontal dimension. 
Exact item validity can then be read at the point of intersection of 
the column and row. 

In order to be of maximum value, a table of this type should be 
expressed in terms which permit its use with any desired number of 
cases and should be such as to permit rapid determination of the degree 
of significance of any difference. The table of phi coefficients accom- 
panying this article fulfills these conditions. The following interrela- 
tionships obtain when the number of cases in the two sub-groups is 


equal: 








xr CR 
Phi Coefficient (¢) = =— _; (1) 
Not VNeot 
Critical Ratio (CR) =¢ VNin = V7} (2) 
Chi Square (77) = N¢?=CR?. (3) 


A Pearson r corresponding to the phi coefficient can be estimated 
if desired. If both variables can be considered as being continuous, 
the Pearson r is estimated by dividing ¢ by .637. If one set of data 
are genuinely dichotomous and the other is continuous but artificially 
reduced to a dichotomy, the Pearson r can be estimated by dividing ¢ 
by .798, although, as Guilford (3) points out, the meaning of such 
figures is questionable and interpretation should be made only with 
extreme caution and with cognizance of the steps by which the coef- 
ficient was derived. If true point distributions are involved ¢ is nu- 
merically equivalent to the Pearson r . 











20 PSYCHOMETRIKA 


In addition to being readily converted to CR and 7’, the phi co- 
efficient has the added advantages of being widely applicable in many 
situations, and being one of the few coefficients which can properly 
be used in some of these situations. 

Assuming that the sub-groups are equal in size, the formula for 
¢ is usually expressed as: 

= (4) 
Vpq 
where p,, = the proportion in terms of total N of one sub-group (up- 
per) responding in a specified way, 
p, =the proportion of the other sub-group (lower) respond- 
ing in the same way, 
p =the total proportion (p, + p,) responding in the specified 
way, 
q = the total not responding in the specified way (1.00 — p). 

In item analysis the test constructor usually deals with propor- 
tions expressed in terms of each sub-group rather than the total N. 
In such case p + g = 2.00. The formula can then be expressed as: 


Pus «Dr 


B® 
ae (5) 

Poe 

2.2 


where p = p, + p, and g=2— (p, — pi). 
Formula (5) can be expressed entirely in terms of p, and 7; as: 





Pu— Di 
ye (6) 
V (Du + D1) (2 — Pu — D1) : 

Formula (6) was used in constructing the accompanying table. 
The 200 coefficients on the outer edges of the table were computed 
by formula to be accurate to six decimal places. The 4850 remaining 
coefficients were obtained by successive subtraction of a constant from 
each of the outer-edge entries. The constant results from the fact 
that p, + p, remains the same and p, — p, decreases systematically 
by .02 on any diagonal from the outer edge which is perpendicular 
to the center diagonal which separates the table into positive and 
negative coefficients. For example: Select a p, of .71 and p, of .00, 
which appears on the outer edge of the table. On a diagonal perpen- 
dicular to the center dividing line, p, and p, progressively become .70 
and .01, .69 and .02, .68 and .03, .67 and .04, etc. In each case p,, + p, 
= .71, and p, — p: progressively decreases by .02 from .71 to .69, .67, 

















C. E. JURGENSEN 21 


.65, .63, etc. The subtracted constant was therefore obtained by mul- 
tiplying the reciprocal of the denominator of formula (6) by .02. Suc- 
cessive subtractions from the outer-edge phi coefficient computed by 
formula thus gave each of the other entries in the diagonal. 

Several checks were made to insure accuracy of the table: (1) 
Each of the two hundred outer-edge phi coefficients was separately 
computed by two persons, (2) each of the two hundred subtractive 
constants was computed separately by two persons, (3) original phi 
coefficients and subtractive constants were checked on the basis of 
comparable quadrants of the table, and (4) each row and each col- 
umn of the final table were checked separately by two persons on the 
basis of comparable quadrants. 

The table of phi coefficients is simple to use. The proportion of 
each of two sub-groups responding to an item in a specified manner 
is determined, the proportions being computed on the basis of the 
number of cases within each sub-group rather than the number of 
cases within the total group. The table is entered vertically by one 
of the proportions and horizontally by the other. The phi coefficient 
is given at the point of intersection of the row and column entries. 
For purposes of reducing the size of the table, only the positive co- 
efficients are included. If entry in the usual manner leads to a blank 
cell in the table, the coefficient can be found by reversing the hori- 
zontal and the vertical entry. The manner of entry together with the 
type of data being handled readily indicates whether the coefficient 
is positive or negative. 

The table assumes an equal number of cases in the two sub- 
groups, and if this assumption is met the coefficient is the same as a 
coefficient computed by formula. The greater the difference between 
the N’s of the two sub-groups the greater will the table coefficient dif- 
fer from the computed coefficient. 

Knowledge on the part of the user of the number of cases in- 
cluded in the groups will permit a quick determination of critical ra- 
tio, level of significance, or chi square as given by formulas (1), (2), 
and (3). The user can then set his own standards for accepting or 
rejecting items, and for assigning weights to items. 

Inasmuch as this table permits rapid and accurate determination 
of item validity on the basis of statistically rigorous techniques, there 
is no justification for using less efficient methods. Such inefficient 
methods now require as much time as the more acceptable, but hither- 
to time-consuming methods. 


REFERENCES 
1. Edgerton, H. A. and Paterson, D. G. Table of standard errors and probable 
errors of percentages for varying numbers of cases. J. appl. Psychol., 1926, 
10, 378-391. 








PSYCHOMETRIKA 


Fiske, D. W. and Dunlap, J. W. A graphical test for the significance of dif- 
ferences between frequencies from different samples. Psychometrika, 1945, 
10, 225-227. 

Guilford, J. P. Fundamental statistics in psychology and education. New 
York: McGraw-Hill, 1942, p. 247. 

Guilford, J. P. The phi coefficient and chi square as indices of item validity. 
Psychometrika, 1941, 6, 11-19. 

Long, J. A. and Sandiford, P. The validation of test items. Bulletin No. 3, 
Department of Educational. Research. Toronto: University of Toronto, 1935. 
Pp. 126. 

Lord, F. M. Alignment chart for calculating the fourfold point correlation 
coefficient. Psychometrika, 1944, 9, 41-42. 

Mosier, C. I. and McQuitty, J. V. Methods of item validation and abacs for 
item-test correlation and critical ratio of upper-lower differences. Psycho- 
metrika, 1940, 5, 57-65. 

Votaw, D. F. Graphical determination of probable errors in validation of 
test items. J. educ. Psychol., 19383, 24, 682-686. 





— 


Poe ee ee eS ee CG 





C. E. JURGENSEN 23 


00 01 02 03 04 05 06 O07 O08 O09 10 11 12 18 14 25 16 17: 8 OD 


100 1000 990 980 970 961 951 942 932 923 914 905 895 886 877 869 860 851 842 834 825 
99 990 980 970 960 950 941 931 922 912 903 894 884 875 866 857 848 839 831 822 818 
98 980 970 960 950 940 930 921 911 902 892 883 874 864 855 846 837 828 819 810 802 
97 970 960 950 940 930 920 910 901 891 882 872 863 853 844 835 826 817 808 799 790 
96 961 950 940 930 920 910 900 890 881 871 862 852 843 833 824 815 806 797 788 779 
95 951 941 930 920 910 900 890 880 870 861 851 842 832 823 813 804 795 786 777 768 
94 942 931 921 910 900 890 880 870 860 850 841 831 821 812 803 793 784 775 766 756 
93 932 922 911 901 890 880 870 860 850 840 830 821 811 801 792 783 773 764 755 745 
92 923 912 902 891 881 870 860 850 840 830 820 810 801 791 781 772 762 753 744 734 
91 914 903 892 882 871 861 850 840 830 820 810 800 790 781 771 761 752 742 733 724 
90 905 894 883 872 862 851 841 830 820 810 800 790 780 770 761 751 741 732 722 718 


89 895 884 874 863 852 842 831 821 810 800 790 780 770 760 750 741 731 721 712 702 
88 886 875 864 853 843 832 821 811 801 790 780 770 760 750 740 730 721 711 701 692 
87 877 866 855 844 833 823 812 801 791 781 770 760 750 740 730 720 710 701 691 681 
86 869 857 846 835 824 813 803 792 781 771 761 750 740 730 720 710 700 690 681 671 
85 860 848 837 826 815 804 793 783 772 761 751 741 730 720 710 700 690 680 670 661 
84 851 839 828 817 806 795 784 773 762 752 741 731 721 710 700 690 680 670 660 650 
83 842 831 819 808 797 786 775 764 753 742 732 721 711 701 690 680 670 660 650 640 
82 834 822 810 799 788 777 766 755 744 733 722 712 701 691 681 670 660 650 640 630 
81 825 813 802 790 779 768 756 745 734 724 713 702 692 681 671 661 650 640 630 620 
80 816 805 793 781 770 759 747 736 725 714 704 693 682 672 661 651 641 630 620 610 


79 808 796 784 773 761 750 738 727 716 705 694 683 673 662 652 641 631 620 610 600 
78 800 788 776 764 752 741 729 718 707 696 685 674 663 653 642 632 621 611 600 590 
77 791 779 767 755 744 732 720 709 698 687 676 665 654 643 633 622 612 601 591 580 
76 783 771 759 747 735 723 712 700 689 678 667 656 645 634 623 612 602 591 581 571 
7 775 762 750 738 726 714 703 691 680 669 657 646 635 625 614 603 592 582 571 561 
74 766 754 742 730 718 706 694 682 671 660 648 637 626 615 604 594 583 572 562 551 
73 758 746 733 721 709 697 685 674 662 651 639 628 617 606 595 584 573 563 552 542 
72 750 737 725 713 700 688 677 665 653 642 630 619 608 597 586 575 564 553 543 532 
71 742 729 717 704 692 680 668 656 644 633 621 610 599 588 577 566 555 544 533 523 
70 734 721 708 696 684 671 659 647 636 624 612 601 590 578 567 556 545 535 524 513 


69 726 713 700 688 675 663 651 639 627 615 603 592 581 569 558 547 536 525 514 504 
68 718 705 692 679 667 654 642 630 618 606 595 583 572 560 549 538 527 516 505 494 
67 710 697 684 671 658 646 634 621 609 597 586 574 563 551 540 529 518 507 496 485 
66 702 689 676 663 650 637 625 613 601 589 577 565 554 542 531 519 508 497 486 475 
65 694 681 667 654 642 629 616 604 592 580 568 556 545 533 522 510 499 488 477 466 
64 686 673 659 646 633 621 608 596 583 571 559 547 5386 524 513 501 490 479 468 457 
638 678 665 651 638 625 612 600 587 575 563 550 539 527 515 503 492 481 469 458 447 
62 670 657 643 630 617 604 591 579 566 554 542 530 518 506 494 483 472 460 449 438 
61 662 649 635 622 608 595 583 570 557 545 533 521 509 497 485 474 462 451 440 429 
60 655 641 627 614 600 587 574 561 549 536 524 512 500 488 476 465 453 442 431 419 


59 647 633 619 605 592 579 566 553 540 528 515 503 491 479 467 456 444 433 421 410 
58 639 625 611 597 584 570 557 544 532 519 507 494 482 470 458 447 485 423 412 401 
57 631 617 693 589 576 562 549 536 523 510 498 486 473 461 449 438 426 414 403 391 
56 624 609 595 581 567 554 541 527 514 502 489 477 464 452 440 428 417 405 394 382 
55 616 601 587 573 559 546 582 519 506 493 480 468 456 443 431 419 408 396 384 373 
54 608 593 579 565 551 537 524 510 497 484 472 459 447 434 422 410 398 387 375 364 
53 600 586 571 557 543 529 515 502 489 476 463 450 438 425 413 401 389 377 366 354 
52 593 578 563 549 535 521 507 498 480 467 454 441 429 416 404 392 380 368 356 345 
51 585 570 555 541 526 512 498 485 471 458 445 432 420 407 395 383 371 359 347 335 
50 577 562 547 532 518 504 490 476 463 450 436 424 411 398 386 374 362 350 338 326 


00 01 02 03 04 05 06 O7 08 09 10 11 12 18 14 16 16 17 tH 


tw 








143 
122 
101 
071 
0 


184 
169 
153 
136 
117 
096 
071 
041 
0 


382 


373 


364 
355 
346 
33 

327 
317 
308 
298 
288 


277 


256 
245 
233 
221 
209 
196 
183 
168 


154 
138 
121 


082 
059 
082 


PSYCHOMETRIKA 


03 04 05 06 O07 08 O09 


524 
516 
508 
500 
492 
483 
475 
467 
459 
450 


442 
433 
425 
416 
408 


> 399 


390 
882 


0C 01 02 03 


510 
502 
493 
485 
A477 
468 
460 
451 
443 
435 


426 
417 
409 
400 
391 


04 


496 
487 
479 
470 
462 
453 
445 
436 
428 
419 


410 
402 
393 
884 
3875 
366 
357 


05 


482 
473 
464 
456 
447 
439 
430 
421 
413 
404 


395 
386 


368 
359 
350 
341 
331 
322 
312 
308 
293 
283 
273 
262 


074 


057 
039 
020 


468 
459 
450 
442 
433 
424 
416 
407 
398 
389 


380 
371 
362 
353 
344 
334 
325 
315 
306 
296 


286 
276 
266 


256 2 


245 


235, 


224 
213 
202 
190 


178 
166 
154 
141 
128 
114 


085 
070 
054 


037 
019 


454 
445 
437 
428 
419 
410 
402 
893 
384 
875 


366 


356 3 


847 3 


338 
329 
319 
310 


06 O07 08 


441 
432 
423 
414 
405 
397 
388 
379 
370 
360 


09 


10 


428 
419 
410 
401 
392 
383 
874 
865 
356 
346 


837 
328 
318 
309 
299 
290 
280 
270 
260 
250 


240 
229 
219 
208 
197 
186 
175 
164 
152 
14 


128 
115 
102 
089 
076 
062 
047 
032 
016 
0 


10 


11 


415 
406 
397 
388 
379 
370 
360 
351 
342 
338 


323 
314 
304 
295 
285 
275 
266 
256 
246 

35 


225 
215 
204 
193 
182 
171 
160 
148 
136 
124 


112 
099 
086 
073 
059 
045 
031 
016 
0 


11 


12 


402 
393 
384 
375 
366 
356 
347 
338 
329 
319 


071 
058 
044 


18 


389 
380 
371 
362 
3538 
343 
334 
825 
315 
306 


175 
164 
153 
142 
130 
118 
106 
094 


082 
069 
056 
043 
029 





14 15 16 17 


183 
172 
161 
150 
139 
127 
116 
104 
092 
080 


067 
055 
041 
028 
014 


030 015 0 


015 
0 


12 


0 


2 
oO 


169 
158 
147 
136 
125 
114 
102 
090 
078 
066 


053 
040 
027 
014 
0 


166 


156 
145 
134 
123 
111 
100 
088 
076 
064 
052 


039 
027 
013 
0 


340 
331 
322 
312 
303 
2938 
284 
274 
264 
255 


143 
132 
121 
110 
098 
087 
075 
063 
051 
039 


026 
013 
0 


18 19 
328 317 


0 


14 15 16 17 18 19 








19 





C. E. JURGENSEN 


26 21 22 23 24 25 26 27 28 29 
758 750 742 


816 
806 
793 
781 
770 
759 
747 
736 
725 
714 
704 


693 
682 
672 
661 
651 
641 
630 
620 


324 
314 


808 
796 
784 
773 
761 
750 
738 
727 
716 
705 
694 


683 
673 
662 
652 
641 
631 
620 
610 
600 
590 


580 
570 
560 
550 
540 
531 
521 
511 
502 
492 


303 


800 
788 
776 
764 
752 
741 
729 
718 
707 
696 
685 


339 
330 
320 
311 


791 
779 
767 
755 
744 
732 
720 
709 
698 
687 
676 


665 
654 
643 
623 
622 
612 
601 
591 
580 
570 


560 
550 
540 
530) 
520 
510 
500 
491 
481 
471 


461 
452 
442 
423 
423 
414 
404 
394 
885 
375 


366 
356 
347 
338 
328 
319 
309 
300 


783 
771 
759 
747 
735 
723 
712 
700 
689 


461 


451 
441 
432 
422 
413 
403 
393 
384 
374 
365 


355 
346 
336 
327 
317 
308 
298 
288 


775 
762 
750 
738 
726 
714 
703 
691 
680 
669 
657 


646 
635 
625 
614 
603 
592 
582 
571 
561 
551 


306 
297 
287 
277 


3 801 290 279 268 


766 
754 
742 


334 
824 
315 
305 
295 
286 
276 
267 
257 


746 
733 
721 
709 
697 
685 
674 
662 
651 
639 


628 
617 
606 
595 
584 
573 
563 
552 
542 


531 


333 


323 
314 
304 
294 
285 
275 
265 
256 


246 23 


737 
725 
713 
700 
688 
677 
665 
653 
642 
630 


255 
245 
5 


729 


225 


292 280 269 258 247 236 226 215 
20 21 22 23 24 25 26 27 28 29 


492 
482 
471 
461 
451 
440 
430 


25 


31 32 33 34 35 36 37 


726 
713 
700 
688 
675 
663 
651 
639 
627 
615 
603 


592 
581 
569 
558 
547 
536 
525 
514 
504 
493 
482 
A472 
461 
451 
441 
431 
420 


420 410 
410 400 390 


400 


390 
380 
370 
350 
350 
341 
331 
321 
311 
302 


292 
282 
272 
263 
253 
243 
233 


204 


390 


380 
370 
360 
350 
340 
330 
321 
311 
301 
291 


281 
272 
262 
252 
242 
233 
2238 


194 


718 
705 
692 
679 
667 
654 
642 
630! 
618 
606 
595 


583 
572 
560 
549 
5388 
527 
516 
505 
494 
483 


473 
462 
452 
441 
431 
421 
411 
400 


271 
261 
252 
242 
232 
222 
212 
203 
193 
183 


710 
697 
684 
671 
658 
646 
634 
621 
609 
597 
586 


574 
563 
551 
540 
529 
518 
507 
496 
485 
474 


463 
453 
442 
432 
421 
411 
401 
390 
380 
370 


360 
350 
240 
330 
520 
310 
300 
290 
281 
271 


261 
251 
241 
231 
222 
212 
202 
192 
182 
173 


702 
689 
676 
663 


.650 


637 
625 
613 
601 
589 
577 


565 
554 
542 
531 
519 
508 
497 


172 
162 


694 
681 
667 
654 
642 
629 
616 
604 
592 
580 
568 


556 
545 
533 
522 
510 
499 
488 
477 
466 
455 


221 
211 
201 
191 
181 
171 
162 
152 


686 
673 
659 
646 
633 
621 
608 
596 
583 
571 
559 


547 
536 
524 
513 
501 
490 
479 
468 
457 
446 


141 


678 
665 
651 
638 
625 
612 
600 
587 
575 
563 
550 


539 
527 
515 
503 
492 
481 
469 
458 
447 
436 


131 


38 39 


670 662 
657 649 
643 635 
630 622 
617 608 
604 595 
591 583 
579 570 
566 557 
554 545 
542 533 


530 521 
518 509 
506 497 
494 485 
483 474 
472 462 
460 451 
449 440 
438 429 
427 418 


416 407 
405 396 
394 385 
384 374 
373 364 
363 353 
352 342 
342 332 
331 322 
321 311 


311 301 
301 291 
290 281 
280 270 
270 260 
260 250 
250 240 
240 230 
230 220 
220 210 


210 200 
200 190 
190 180 
180 170 
170 160 
161 150 
151 140 
141 131 
131 121 
121 111 


80 31 382 33 34 35 36 37 38 39 































26 


PSYCHOMETRIKA 


20 21 22 23 24 25 26 27 28 29 


305 
296 
286 
276 
267 
257 


294 
284 
274 
265 
255 
246 
236 
226 
216 
206 


196 
186 
176 
166 
156 
146 
155 
125 
114 
103 


024 
012 


012 0 
0 


282 
273 
263 
253 
244 
234 
224 
214 
205 
195 


058 
047 
035 
024 
012 
0 


271 
261 
252 
242 
232 
222 
213 
203 
193 
183 


260 
250 
240 
231 
221 
211 
201 
191 
181 
172 


161 
151 
141 
131 
121 
110 
100 
089 
078 
068 


057 
046 
034 
023 
012 
0 


249 
239 
229 
219 
210 
200 
190 
180 
170 
160 


150 
140 
130 


238 
228 
218 
208 
199 
189 
179 
169 
159 
149 


139 
129 
118 
108 
098 
087 
077 
066 
055 
045 


034 
023 
011 


227 
217 


216 


20 21 22 23 24 25 26 27 28 


205 
195 
185 
176 
166 
156 
146 
136 
126 
116 


29 


30 31 32 33 


184 178 
174 163 
164 153 
154 144 
144 134 
134 124 
124 114 
114 104 
104 093 
094 083 073 


084 073 063 
074 063 
063 053 042 
053 042 032 
043 032 021 
032 021 011 
032 921 011 0 
022 011 0 

011 0 

0 


194 
185 
175 
165 
155 
145 
135 





34 35 36 37 


152 142 131 121 
142 132 122 111 
132 122 112 101 
122 112 102 091 
113 102 092 081 
103 092 082 071 
092 082 072 061 
082 072 062 051 
072 062 051 041 
062 052 041 031 


052 041 031 021 


031 021 010 0 
021 010 0 

011 0 

0 


388 39 
111 101 


010 0 


052 042 031 021 010 0 


30 31 32 33 34 35 36 87 88 389 























— 


SRRRRRee Seeeeeesess 


40 41 
655 647 


120 110 





C. E. JURGENSEN 


42 43 44 45 46 47 48 49 


639 631 624 616 608 600 593 585 
625 617 609 601 593 586 578 570 
611 603 595 587 579 571 563 555 
597 589 581 573 565 557 549 541 
584 576 567 559 551 543 535 526 
570 562 554 546 537 529 521 512 
557 549 541 532 524 515 507 498 
544 586 527 519 510 502 493 485 
532 523 514 506 497 489 480 471 
519 510 502 493 484 476 467 458 
507 498 489 480 472 463 454 445 


494 486 477 468 459 450 441 432 
482 473 464 456 447 438 429 420 
470 461 452 443 434 425 416 407 
458 449 440 431 422 413 404 395 
447 438 428 419 410 401 392 383 
435 426 417 408 398 389 380 371 
423 414 405 396 387 377 368 359 


190 180 170 160 150 140 131 121 
180 170 160 150 140 130 120 110 


170 160 150 140 130 120 110 100 
160 150 140 130 120 110 100 090 
150 140 130 120 110 100 090 080 
140 130 120 110 100 090 080 070 
130 120 110 100 090 080 070 060 
120 110 100 090 080 070 060 050 
110 100 090 080 070 060 050 040 
100 090 080 070 060 050 040 030 


110 100 090 080 070 060 050 040 030 020 
100 090 080 070 060 050 040 030 020 010 


40 41 


42 438 44 45 46 47 48 49 


50 51 


- 577 570 


562 554 
547 539 
532 524 
518 510 
504 496 
490 482 
476 468 
463 454 
450 441 
436 428 


424 415 
411 402 
398 385 
386 377 
374 364 
362 352 
350 340 
338 328 


326 317 307 


314 305 


303 294 
292 282 
280 271 
269 260 


27 


52 53 54 55 56 57 58 59 


562 554 547 539 531 523 516 508 
546 539 531 523 515 507 499 491 
581 523 515 507 499 491 483 475 
516 508 500 492 483 475 467 459 
502 493 485 477 468 460 451 443 
487 479 470 462 453 445 436 428 
473 464 456 447 439 430 421 418 
459 450 442 433 424 416 407 398 
445 437 428 419 410 402 393 384 
432 423 414 405 397 388 379 370 
419 410 401 392 383 374 365 356 


406 397 388 379 370 360 351 342 
393 384 375 366 356 347 338 329 
380 371 362 353 343 334 325 315 
368 358 349 340 331 321 312 302 
355 346 337 327 318 309 299 290 
343 334 324 315 306 296 286 277 
331 322 312 303 293 284 274 264 
319 310 300 291 281 271 262 252 
298 288 279 269 259 250 240 
296 286 276 267 257 248 238 228 


284 274 265 255 246 236 226 216 
273 263 253 244 234 224 214 205 
261 252 242 232 222 213 208 193 
250 240 231 221 211 201 191 181 


258 249 239 229 219 210 200 190 180 170 


247 238 
236 227 
226 216 
215 205 
204 194 


194 184 
183 178 
173 163 


111 101 
100 091 


090 080 
080 070 
070 060 


228 218 208 199 189 179 169 159 
207 197 188 178 168 158 148 
206 196 186 177 167 157 147 1387 
195 185 176 166 156 146 136 126 
185 175 165 155 145 135 125 115 


174 164 154 144 134 124 114 104 
163 153 144 134 124 114 104 098 
153 148 133 123 113 103 093 083 
142 132 122 113 103 092 082 072 
132 122 112 102 092 082 072 062 
122 112 102 092 082 072 062 051 
111 101 091 081 071 061 051 041 

091 081 071 061 051 041 0381 
091 081 071 061 051 041 031 020 
081 071 061 051 041 030 020 010 


070 060 050 040 080 020 010 0 
060 050 040 030 020 010 0 
050 040 030 020 010 0 


060 050 040 030 020 010 0 


050 040 
040 030 


080 020 010 0 
020 010 0 


030 020 010 0 


020 010 
010 0 
0 


50 51 


0 


52 58 54 55 56 57 58 59 








28 


4G 


091 
081 
071 
J61 
051 
041 


41 


080 
070 
060 
050 
040 
030 


42 
070 


43 
060 


44 
050 


45 
040 


46 


0380 


PSYCHOMETRIKA 


4T 
020 


48 
010 


060 050 040 030 020 010 0 
040 030 020 010 0 


050 
040 
030 
020 


030 020 010 


020 
010 
0 


375 


360 & 


346 


333 8 


319 
306 
295 
286 
267 
255 
242 
230 
218 


206 


010 
0 


41 


010 


0 


198 
186 


163 
151 
140 
129 
117 
10€ 
095 
084 


074 
063 
052 
042 
031 
021 


030 
020 
010 
0 


074 


042 
031 
021 
010 


010 0 


0 


020 
010 
0 


064 


053 
042 
032 
021 
010 
0 


010 
0 


168 


043 
032 
021 
011 


0 


46 


66 


453 


208 


158 
146 


122 
110 


099 


076 
065 


043 


032 
021 
011 
0 


67 


445 
426 
408 
390 


are 
373 


357 
341 
325 
310 
295 
280 


266 


065 


48 


68 


436 
418 
399 


382 
364 


348 


331 : 
=< 


$15 < 


300 
285 
270 


256 


055 


49 
0 


49 


428 
391 


151 
139 
126 


114 
102 
090 
078 
067 
055 
044 


054 044 033 


043 
032 


033 
022 


022 
011 


021 011 0 


011 
0 


0 


60 61 62 63 64 65 66 67 68 69 


50 


103 
091 
079 
068 
056 
045 
033 
922 
O11 
0 


7 71 72 73 74 75 76 77 78 19 


51 52 


51 52 


092 081 
080 069 
068 057 
057 046 
045 034 


538 54 55 5 


083 


070 
058 
046 
084 
023 


54 55 56 57 


74 75 6 7 


387 
366 
346 
327 
308 


071 


378 
357 
337 
317 
298 
280 
262 
245 


060 


369 360 
348 339 
327 317 
307 297 
288 278 
270 259 
252 241 
235 224 
218 207 
202 191 
186 175 


171 160 
156 145 
142 130 
127 116 
114 102 


164 152 


148 136 
133 121 
118 106 
104 092 
090 078 





100 088 076 064 | 


087 075 
074) 062 
061 049 
048 037 


063 051 
050 038 
037 025 
025 012 


059 048 036 024 012 0 
047 035 024 012 0 
035 023 012 0 
023 012 0 


011 


034 023 011 0 


022 011 
011 0 
0 


0 


0 





8 





Bo O44 
310 300 
288 277 
266 256 
246 235 
227 215 
208 197 
190 178 
173 161 
156 144 
140 128 


124 112 
109 097 
094 082 
080 067 
066 053 
952 039 
039 026 
025 018 
013 0 


i 0 





80 81 


82 83 84 85 86 


vi4 
290 
267 
245 
224 
204 
185 
166 
149 
132 
115 


099 
084 
069 
055 
040 
027 
018 
0 


82 


305 
280 
256 
233 
212 
192 
172 


295 
269 
245 
222 
200 
179 
160 


285 2 
258 2 


74 
47 


233 221 
210 197 
188 175 


83 84 85 


86 


87 


264 
235 
209 
184 
161 
140 
119 
100 
082 
064 
047 


031 


015 
0 


87 


88 89 
253 241 


223 211 
196 183 


049 038 
032 016 


016 0 
0 


88 89 





C. E. JURGENSEN 29 


90 91 92 93 94 95 96 97 98 99 100 


229 217 204 190 176 160 148 123 101 071 0 
197 184 169 153 186 117 096 071 041 0 

168 154 138 121 102 082 059 052 0 

142 126 110 082 072 051 027 0 

118 191 984 066 046 024 0 

095 078 061 042 022 0 

074 057 039 020 0 

054 037 019 0 

85 018 0 


017 0 
9 


90 91 92 98 94 95 96 97 98 99 100 




















PSYCHOMETRIKA—VOL. 12, No. 1 
MARCH, 1947 


THE USE OF PSYCHOLOGICAL TECHNIQUES IN MEASURING 
“AND CRITICALLY ANALYZING NAVIGATORS’ 
FLIGHT PERFORMANCE 


LAUNOR F. CARTER 
UNIVERSITY OF ROCHESTER 
AND 


FRANK J. DUDEK* 
UNIVERSITY OF SOUTHERN CALIFORNIA 


Under controlled flight conditions, the distance between a 
navigator’s report of position and his actual position is a criterion 
of success in dead reckoning navigation. Students’ logs were eval- 
uated for five separate missions by comparing the students’ entries 
with standards determined by experts. The reliability of this tech- 
nique is indicated by the fact that mission to mission intercorre- 
lations of error scores were low, while the intercorrelations be- 
tween legs of the same mission were moderately high. The inter- 
correlations between the error scores for the different navigation 
variables were computed and analyzed by using both factor analysis 
and multiple regression techniques. Both analyses indicated that 
a major portion of all dead reckoning error could be attributed to 
errors made in determining magnetic deviation. As a result of 
these analyses, recommendations were made for changing the in- 
struction in dead reckoning and alterations in the equipment used 
were suggested. 


1. The Problem 


Since one of the most difficult tasks in psychology is the objective 
evaluation of complex skills, one of the most difficult tasks of the 
aviation psychologist is the objective evaluation of flight performance. 
It is the purpose of this paper to discuss a method for evaluating navi- 
gators’ dead reckoning proficiency as an example of the difficulties 
encountered in assessing aerial performance and also to describe a 
method by which a successful evaluative technique may be used to 
analyze performance critically and to lead to its improvement. 

At first glance it would seem that the evaluation of dead reckon- 
ing would be simple and straightforward. The navigator’s plane de- 
parts from a known position; by determining the true air speed, the 
wind, and the true course flown, the navigator calculates his position 
for a particular time without recourse to check points on the ground. 
By comparing the navigator’s calculated position with his actual po- 
sition, it is possible to obtain an immediate, objective measure of the 
accuracy of his work. In actual practice this can reasonably be done 

* This work was done as a part of the AAF Aviation Psychology Program. 


The authors are indebted to Thomas Paltier, Harley Smith, Wolcott, Lyon, and 
John King for assistance with the calculations. 


31 








32 PSYCHOMETRIKA 


for the navigators in any one plane, but as soon as it is desired to 
compare the performance of navigators from plane to plane, it is 
found that the results are not comparable due to a multitude of un- 
controlled flight conditions. Even though two planes depart from the 
same point for the same destination, it is not proper to evaluate the 
navigators’ relative skill on the basis of their objective results in terms 
of position error unless most careful controls have been instituted to 
assure similar flight conditions for both planes. The development of 
an objective scale for evaluating navigation skill involved setting up 
conditions in which all the navigators would be exposed to the same 
situation in the air, perform the same kind of navigation, and have 
their performances evaluated in terms of a precise standard and in 
an objective and equitable manner. At the same time, the technique 
used had to insure that the missions were flown safely, represented an 
efficient expenditure of plane time, and were consistent with the other 
operational problems encountered at a training school. In addition, an 
adequate technique had to be reliable and reproducible. 


2. Description of the Technique 
The technique employed was to fly the navigation missions in 
formation and to require the navigators to perform “follow-the-pilot” 
navigation, that is, dead reckoning navigation in which the navigator 
simply determines the position of the plane at specified times but does 
not direct the pilot regarding the course he is to fly. By flying forma- 
tions in which the planes were within a few yards of each other, all 
the students were flown over the same course, at the same speed, and 
under the same weather conditions. Several different types of forma- 

tions and planes were used during this experiment. 


Each mission consisted of four legs with each leg covering ap- 
proximately 100 miles and terminating at selected turning points. 
At each turning point the formation changed course by approximately 
90°. The turns were made consistently in one direction so that a some- 
what square track was followed. Flying the missions over a square 
track yielded three advantages: (1) it was possible to return to the 
place of departure and thus secure maximal use of student and plane 
time; (2) by turning 90° at each turning point, the students were pro- 
vided with an opportunity to obtain the best estimate of wind on two 
headings; and (3) four legs of 100 miles each made possible the collec- 
tion of adequate, comparable, and independent samples of navigation 
on each leg. At the same time that the students were required to deter- 
mine their position at each of the four turning points by dead reckon- 
ing, expert lead navigators in the first plane of the formation were di- 
recting the course flown and keeping a precise record of the actual 








LAUNOR F. CARTER AND FRANK J. DUDEK 33 


track made good over the ground. While the student navigators had to 
rely only on their calculations, the lead navigators actually looked at the 
ground and carefully determined where they were at all times. Other 
personnel in the planes made very accurate readings of the air speed 
meter, the compass, the drift meter, and the astro-compass. In this 
way extremely accurate standards were developed with which the 
students’ log entries could be compared. Since the students all navi- 
gated under the same flight conditions and objective standards were 
available for evaluating their performance, it was believed that an 
adequate technique had been developed. 


3. The Reliability of the Technique 

In October and November of 1944, approximately 180 students 
were flown on three different flight missions, and again during the 
summer of 1945, 80 students were flown on four different flight mis- 
sions spaced throughout the navigation course. From the data collected 
on these missions, it was possible to analyze the reliability of the 
technique. Two types of analysis were possible, namely, an intra-mis- 
sion reliability which corresponds to a split-half reliability and an in- 
ter-mission reliability which corresponds to a test-retest reliability. 

Since the missions were flown by legs, there were four indepen- 
dent evaluations of each navigator’s performance. The correlation be- 
tween the errors made on the first and fourth legs and the errors 
made on the second and third legs were computed for a number of 
navigation variables ; however, only the reliabilities for true air speed 


TABLE 1 


Within Missions Reliability Coefficients for True Air Speed 
(Legs 1-4 With Legs 2-3) 




















‘Flight For- Mission 5 Mission 6 Mission 7 All Missions 
mation N Rho N Rho N ~~ Rho Combined r 
OSA 1 28 .88 23 .56 22 -76 78 
Z 19 25 18 75 1 86 58 
23RB 1 23 59 23 84 20 21 48 
2 20 25 20 12 20 61 57 
o4A 1 21 51 21 .68 24 -75 .68 
Z 17 94 17 57 21 Bj -70 
o4B 1 21 56 21 .64 24 68 .63 
2 17 45 18 383 20 48 42 
Combined r , oe 
61 61 


— 7 
r=. 











34 PSYCHOMETRIKA 


will be discussed at this point since the conclusions drawn from the 
true air speed reliabilities are similar to those drawn from the re- 
mainder of the data. Table 1 shows the correlation between the error 
scores of the first and fourth legs and those of the second and third 
legs for the first group of students. 

Since the error scores obtained were not normally distributed 
and it was felt that extreme scores should not bias the reliabilities, 
rank order correlations rather than Pearsonian correlations were cal- 
culated. However, the combined correlations were obtained by con- 
verting the rho’s to r’s and combining them by the z transformation 
and reconverting to r’s. From Table 1 it will be seen that while there 
is considerable variability from formation to formation, the reliabili- 
ties are fairly high, the Spearman-Brown correction giving a relia- 
bility of .77. But from examining the data, it was suspected that this 
reliability was unduly high due to constant factors within particular 
planes; thus, if the air speed meter in one plane were improperly 
calibrated, all of the students within that plane would make high 
error scores which would not be indicative of their true ability. If 
there were no systematic factors within planes, it would be expected 
that the correlation between the error scores of students within the 
same planes would be 0. Table 2 shows the correlation of the error 
scores of students in the same planes by seats. In view of the magni- 
tude of these correlations, it was thought probable that systematic 
plane differences accounted for some of the reliability shown in Table 


a. 


TABLE 2 
Correlation Coefficients 
Seats for True Air Speed on Mission 5 














Flight Seats 1 and 2 Seats 2 and 3 Seats 1 and 3 
N~ Rho N Rho N Rho 
23A 14 36 13 54 13 .22 
23B 14 .20 13 —.12 14 —.11 
24A 10 10 10 49 11 34 
24B 11 21 12 .24 10 65 
Combined r 25 30 .28 





These results indicated that a more appropriate reliability might be 
obtained by analysis of covariance. After the variances associated with 
plane and seat were removed, a coefficient of .48 was obtained for the 
ultra-mission reliability of the individual’s scores. 

Another estimate of the technique’s reliability was obtained by 





a FF 


SO fe ree Ss SS a 


OQ 1 





LAUNOR F. CARTER AND FRANK J. DUDEK 35 


comparing performance on successive missions. The error scores be- 
tween missions were computed and the results are shown in Table 3. 














TABLE 3 
Between Missions Correlation Coefficients for True Air Speed 
Flight For- Mission 5 vs. 6 Mission 6 vs. 7 Mission 5 vs. 7 
mation WN Rho N Rho N Rho 
1 23 —.13 22 04 22 —.08 
aes 2 18 .03 17 —.27 17 —.16 
23B 1 23 —.08 20 —.35 20 —.06 
2 20 —.05 20 .00 20 .28 
1 19 —.61 21 ook 21 14 
24A 
2 15 17 17 13 17 —.19 
1 18 45 21 —.10 21 01 
_ 2 15 —.34 18 .08 17 —.11 
Combined r —.01 .00 .00 





This complete lack of relationship raises serious question regarding 
the adequacy of the technique. However, since the experiments took 
place at a time when the learning curve was steep, the test-retest relia- 
bility coefficient may have been somewhat attenuated. There were 
probably other uncontrolled variables, such as day-to-day fluctuations 
in the planes, the relative skill of pilots, and variable weather con- 
ditions. It is, of course, also possible that the technique of measure- 
ment is sufficiently reliable, but that the navigation task is so diffi- 
cult and so influenced by chance factors that consistent test-retest re- 
sults could not be expected except for a great number of missions. If 
this is the case, it raises serious doubt as to the possibility of ever 
practically achieving an adequate objective evaluation of navigators’ 
performance. It should be remembered that each student navigated 
over twelve hours in furnishing the data on which these reliabilities 
are based. 

Since these reliabilities are based on data collected from the first 
application of the technique and since somewhat unstable AT-7 planes 
seating only three students were used, it was decided to apply the 
technique to a second sample of 80 cadets at Ellington Field where 
larger and more stable C-47’s, seating eight navigators, were available. 
The Ellington Field students were flown on missions in their seventh, 
eleventh, sixteenth, and twentieth weeks of training. Table 4 shows the 
correlation between successive missions and between the total errors 








36 PSYCHOMETRIKA 


on missions 1 and 4 against those on missions 2 and 3. In addition to 
correlations for true air speed, the correlations for deviation, drift, 


and distance-off are shown. 


TABLE 4 
Between Missions Correlation Coefficients from the 
Second Series of Missions 














Missions Missions Missions Missions 
Variable land 2 2 and 3 3 and 4 1+4and2+ 3 
Drift we bd —.01 19 
Deviation % 13 —.03 .09 
True Air Speed 27 18 13 27 
Distance-Off —.10 13 01 —.03 





Certainly, the intercorrelations in this table are not high enough to 
suggest ‘that this technique, even though internally reliable, will show 
consistent positive correlations when administered from time to time. 

Similar correlations have been reported for other measures of 
complex flying skill. The following data have been reported by Psycho- 
logical Research Project (Pilot), Randolph Field, Texas, in their Re- 
port for the Fiscal Year 1945. Table 5 shows the reliability for data 
collected in measuring the performance of elementary pilot students 


in landing. 


TABLE 5 


Reliability of Measures of Elementary Pilot 
Students’ Landing Ability 








Ground vs. _ 1st Landing vs. 1st Day Landings vs. 





Measures Air Observer 2nd Landingon 2nd Day Landings 
Trial1 Trial 2 Day1 Day2 Landing1 Landing 2 
Zone in which 
Plane Lands 0° .86 14+ 11 02+ 01 
Landing Attitude 87 .68 45 18 —.07 15 
Bounced or 
Dropped 89 87 44 19 —.01 01 





* The NW for these six correlations equals 152. 
+ The " for these six correlations equals 170. 


Again it will be noted that the reliability of ratings made for any par- 
ticular mission is high while the reliability of ratings made for diff- 
erent missions on different days tends to be low. Without offering fur- 
ther data, it may be ventured that most measures of flight perform- 
ance for the navigator, pilot, gunner, and bombardier will tend to 











LAUNOR F. CARTER AND FRANK J. DUDEK 37 


show low inter-mission correlations even though intra-mission relia- 
bilities may be fairly high. 


4. Use of the Technique in Critically Analyzing Performance 

Whenever it is possible to measure the component parts of any 
performance, an objective assessment of the relative importance of 
the several kinds of operator errors may be undertaken. On logical 
grounds it was possible to attribute poor performance to a number of 
different types of error in navigation; but it was not possible to deter- 
mine objectively the relative importance of these errors until some 
technique had been developed to measure the accuracy both of over-all 
performance and of each of the steps of navigation. The technique pre- 
viously described made it possible to determine the particular opera- 
tions responsible for most of the navigator’s error. By correlating the 
error scores for each navigation operation, a correlation matrix for 
the error scores was constructed. In analyzing this matrix, either by 
factor analysis techniques or by multiple regression techniques, it was 
possible to determine the major causes of navigation error.and their 
relative importance. Tables 6 and 7 show the correlations between the 
error scores for the different navigation variables for three sets of 
data. ° 


TABLE 6 
Intercorrelations Between Errors for 8 Navigation Variables for 165 Students 
Who Were in Their 7th Week of Training at Selman Field 














Track Drift DEV TAS WF WD GS DO |MEAN SD. 
Track ll 7 -08 12 17 «4.06 .78 | 1840 804 
Drift ll 01 00 72 27 52 12] 510 835 
Deviation 7 8 Ol 02 Of 17 11 «667 | 1890 820 
True Air Speed | -.03 .00 -.02 04 -18 28 12 | 820 6.65 
Wind Force a: a a 12 57 18 | 1030 8.10 
Wind Direction | 17 .27 17 -18 .12 27 ~=©.18 | 135.00 87.00 
Ground Speed 06 62 11 28 567 27 30 | 19.20 9.76 
Distance-Off 7 312 67 «4.12 18 18 ~ © .30 | 31.40 14.80 





It should be noted that Table 7 shows two sets of data, one set above 
the diagonal and the other below. Table 8 shows the rotated factor 
loadings and communalities for the three sets of correlations shown 
in Tables 6 and 7. 











*[BUOSLIP 9y4} MOTEq 1240 oy} PUB aAOGe oUO ‘a[qQuz SITY} U] SUOT}B[AI100.1E}U] JO SOS OM} OIG BIOY} YBY} OQON » 














GL'9Z O0V'Z9 = 9E'SE OED or o8 ve 62 68 GZ- 90- ST- S& SL’ TL TS £8" (AN) BO-souezsIq 
“gt Teh ee 19 | 0% o” 6OF)6=660" 60 «(OT «6T- 0- Ww 8 CO T0"- (UIA) PefPaeryL OULL 
SEI IsbZ ELST Z29TZ| 98 8g esses LO”siSSCSSC— C—O 00° |(WN) pojeaery couessiq 
LOST STs Let GLes| Le TT 8r Ta | a 2 i {Le LO" (MI) peeds punorp 
GvIZ 80h T9'0L 09°98 | 90° 02 Tr gs ry «6 TO" «LT SOLE COSsL™SCCGD™ ve (,) UOI{DOITt PUTA ty 
Gv'8 SELT ZTL Lg'OT | 00° =2e- Se- ST 80- 2-0 sI- 6T 08 O08 g8 Go" (XH) 90200 PUM FE 

< ves OS 69% O89 | 60 FT Le 88 Z0- 82 9L 688) LE «LE «(OF «8Z-CT= (4) peads ary ena = 

Mm OLS 89h 98% ss9 | 0s 80 8s Ss ST 690" 69 06° %33- Te- TZ- g0- ‘— | CH) peeds ary pezezqied 

Se ere ro oss 6eL | ST 90 Of 8h st 80° 89 96 IZ- I@- F@- 8I-  13- | (aH) peedg ary peyeorpuy © 

H 986 92b% %28L F802] ZL) FT 60 TO Zt 00° «6ST (Se BT 0OT 06 &T $8" (.) woryeraag & 

S 986 92% 918 soz | 19 FI 80 80- GO. LO OT «TO 660" (SG 06 =«&T $8" (.) SuIpesy oHeuseW & 

m 66 s0ss 908 802/89 FT Go LO- So co 80 LO 80 2 8&6 Le 28" (,) Surpeox onzy, 

5 6b O80l Gly seg | Th sh oF Ts" 08 80 OT 20° 80° «FO «20-0 9T° (.) wud 

2 g8'2t 309% 66 OL'TZ| 89° LI 00° 90- FO TI SI'- LO- SI- LL 08 68° 20° (,) PBIL 
‘aS NVGW ‘'S NVAW|OG WL IsId SO) GM dM SVL SVO SVI ATG HW HL Wud RL 

q—5 [—¥ da ¥ V SLHDITA 

















Sulurely, JO YOM YIL Alyy ul yor syuopnyg Op Jo ,Sdnoiy OMI, AOJ SofquiteA UOTZeSIAeN PI AOJ SiOAIW useMyog UOTZe[e1100I10}UT 
L aTaViL 


fo.) 
oD 





39 


LAUNOR F. CARTER AND FRANK J. DUDEK 





“SISA[BUB SITY} UL pepnjoul Jou a19M Ivadd¥e sSurpvo] OU YIIYM 1OJ salquLIVA aU], » 

















9L° TS &t 61 SL’ Eee 60) =e; Sh» ER G0 ry OL ER YO-sur4siq 
82 TS 8 80°- 60- 90 We> 10 OS OBE: SBUI[PABL], OUI, 
66 T6 OF 82 00 a9) Si ee - “80 pepeaeLly, sduejysiq 
So 2S: VS ..6r | WO 8 LY Ee ee SS UP Ob GP OSE peedg punory 
oo vi St «2 OLE 6h at 6 669")— TO S86 BL UoTpoITT PULM 
gs LO” 80- «(06"—COLT GO. ye L2T-c0-- Te ot- 6 Lb dd10.J PUL 
OL: 60 SL ‘sct- 2y- 89 cc wm FO GW St- WwW 980- €0- peedg ary onary, 
ce «10)0CUdTHCDC i -— LLCS peedg ary pazerqie9 
T6 TO- T6 80- 82- Tt? 06 TO £«10- peedg ary pezeorpuy 
86 80° =FT- 80 86 66 T° tT0- ve LL TO 00 TO LS UOlzPBIAVG 
86° 60 *FI- £0 86° 06° 3t 90 é SuIpeoy oy4ouse yy 
838° g0- «80 «66t CO v6 90 3O- LO * Suipesyy onazy, 
13° TO- TI- 68 OT 03 70 868) «690°0—| 628 O00" 60" OSs WA 
6L. 20 90° OF 88 cs’ 66T- )«€©6L00 O88" 8"iCO:t«é*”— “ss GST VP yori 
zu AI I “II I 74 Ill If I eu 6 ATOM If I S10PB J 





U0LSUT[ 72 YOoM 
UIL UL SPUEPNyS 19440 OF 


UOISUITIA 72 YOM 
YiL UI SqUapNyS OF 


UBUI[IS 7B 490M 


UdL UT S}USpNyS GOT 


So[QBVlIVA UOIVSIABN 








SUOISSI[ JUSIOYIG, VAY], IOJ sBuipvoy 10,0e,7 pozezoy 


8 ATAaVL 














40 PSYCHOMETRIKA 


These factor loadings were obtained by the centroid method of factor 
analysis. In these analyses the variable, distance-off, may be con- 
sidered as a criterion since the types of errors are being sought which 
contribute most to over-all inaccuracy in dead reckoning navigation. 
In each analysis in Table 8, the first factor contains the heaviest dis- 
tance-off loading and thus the variables defining this factor will be 
those which are responsible for the largest part of dead reckoning 
error. It will be noted that deviation has a high loading in the first 
factor in each of the analyses. Since deviation and drift are the two in- 
dependent variables making up track, it is apparent that errors in 
deviation are the major cause of error in dead reckoning navi- 
gation. (Drift and deviation are the two independent variables which 
when applied to compass heading determine track; errors in the other 
“heading variables” such as magnetic heading, true heading, and track 
are directly attributable to errors in deviation or drift.) 

The.second factor in each analysis has a high loading in drift and 
also a high loading in one of the wind variables; but it has only mod- 
erate or low factor loadings on the criterion, distance-off. Similarly, the 
third factor has its highest loadings in one of the speed variables, 
ground speed or air speed; and this factor also has only moderate or 
low loadings in the criterion. The fourth factor, in the cases when it 
has been extracted, may be a specific factor associated with the par- 
ticular mission involved. 

From this analysis two important points stand out. First, in all 
the analyses the first factor is most clearly attributable to error in 
the determination of deviation and it is also the most important fac- 
tor in accounting for navigation inaccuracies. The second point is that 
in each analysis three factors, each identifiable by the same loadings 
from analysis to analysis, are clearly identifiable. 

It may be suggested that in each analysis there are certain load- 
ings for one of the factors which do not appear in other analyses. This 
is only to be expected. The psychologist should not think of these mis- 
sions as similar to carefully controlled testing situations since, as has 
been previously mentioned, the actual testing situation changed from 
mission to mission. The wind forces were considerably different on 
each mission ; the type of plane and the type of formation used at Sel- 
man Field differed from those used at Ellington Field. Thus it is sur- 
prising that the results were as consistent as they are. Another reason 
for these fluctuations is the small number of cases in the two Elling- 
ton Field groups. They were not analyzed together since for the first 
flight the average wind force was between 10 and 15 knots while for 
the second flight it was between 20 and 25 knots. The influence of the 











LAUNOR F. CARTER AND FRANK J. DUDEK 41 


number of cases may be noticed in the intercorrelations and factor 
loadings for drift, wind force, and wind direction. In the first Elling- 
ton Field analysis three students obtained reciprocal winds by plotting 
negative drifts when they should have been positive and vice versa. 
The result was that there were no errors in wind force to correspond 
to the large discrepancies between actual drift readings and the stu- 
dents’ estimates, but the error in wind direction was at a maximum. 
Thus, the correlation between errors for wind force and drift was low, 
while the correlation between errors for wind direction and drift was 
high. Even in spite of these errors the over-all factorial composition 
is quite consistent. 

To check and extend the above results, the correlation matrices 
presented in Tables 6 and 7 and the results from three later missions 
were analyzed by the use of multiple regression techniques. The vari- 
able, distance-off, was considered the independent variable and beta 
weights were determined for the other variables. Thus the highest beta 
weights would indicate those variables whose errors contributed the 
most to distance-off in dead reckoning navigation. Table 9 shows the 
beta weights and multiple correlations obtained when distance-off is 
predicted by seven navigation variables, and also when it is predicted 
by the three completely independent and basic variables. 


TABLE 9 
Beta Weights and Multiple Correlations Obtained for Predicting Distance-Off 
from Seven Navigational Variables and from the Three 
Basic Navigational Variables 























Group Selman Ellington | 
Week in Training 7th 7th 11th 16th 20th 
| Track 67 53 84 17 61 
. | Drift 00 09 .00 02 .00 
| Deviation 13 28 .00 .20 05 
| True Air Speed .08 .00 .00 .24 07 
S| Wind Force .00 .00 .00 16 .00 
Wind Direction .00 00 16 .05 .07 
Ground Speed 23 ol ou 24 39 
ae 83 87 2 © ©«99 85 
&| Drift Bel 25 15 36 09 
| Deviation 67 71 60 .76 Al 
S| True Air Speed 13 01 —.02 10 .29 





ee 69 17 68 84 49 














42 PSYCHOMETRIKA 


It will be seen that the results of the factor analyses are confirmed and 
that deviation is again the most important of the variables (remem- 
bering that errors in track are a,function of errors in deviation). The 
relative importance of errors in deme air speed and in drift seems to be 
about the same, but both are considerably less important than errors 
in deviation. 

On the basis of these results, it was possible to recommend to the 
personnel responsible for navigation training that the amount of in- 
struction and practice on the different techniques for determining 
deviation be increased. The results also indicated that the instruments 
used for determining deviation might be faulty and the consideration 
of these instruments led to the recommendation of several changes in 
the astro-compass to improve the accuracy with which it could be used. 
It will be noted that in studying a complex skill, it is first necessary 
that a technique be developed for assessing over-all performance of 
the skill and also for assessing the separate components determining 
this performance. Once these mezsures have been developed, it is 
possible to determine statistically the importance of the various parts 
of the task. Such an analysis can then be used as a basis for recommend- 
ing changes in courses of instruction or as a point of departure for 
examining the different parts of the task to determine those aspects 
of performance which can be most profitably improved. 

5. Summary 

A technique for assessing dead reckoning navigation performance 
was developed. The technique consisted of flying a large number of 
students simultaneously in formation over the same course. The re- 
liability of this technique was found to be fairly high within any one 
mission, but to be practically zero when measured by the correlation 
between missions. Considering these reliabilities and others collected 
in measuring pilot’s landing ability, it is concluded that in many com- 
plex skills reliability for any particular trial may be high and yet the 
correlation between trials, which corresponds to test-retest reliability, 
may be low. 

Once the technique for assessing performance had been developed, 
it was possible to determine the principal cause of error in dead 
reckoning performance by the application of factor analysis and mul- 
tiple regression techniques. On the basis of these results, it was pos- 
sible to make recommendations, based on objective evidence, regarding 
changes in instruction and improvement of the instruments used in 


navigation. 











PSYCHOMETRIKA—VOL. 12, NO. 1 
MARCH, 1947 


ANALYSIS IN TERMS OF FREQUENCIES OF DIFFERENCES 


HAROLD A. VOSS 
THE haeceteiaes AND GAMBLE COMPANY 


A technique of analysis utilizing frequencies of differences is 
described and applied to a hypothetical experiment involving two 
methods of instruction. A nomograph is provided for computing the 
chi-square values applicable to the method. 


On several occasions, the author has had the problem of compar- 
ing the effectiveness of two training methods, or aids, where a num- 
ber of experimental conditions were involved. In such case, the inves- 
tigator may employ analysis of variance or the conventional tech- 
niques for evaluating the reliability of differences. However, either 
time pressure or the preliminary nature of the investigation may 
make it advisable to short cut the lengthy computations involved in 
these methods. The method to be described is believed to fulfill the 
need for such a short cut. 

Briefly, the method involves tabulation of the paired measures 
under the varied conditions of the experiment. The measures may 
be scores of individuals, means, standard deviations, percentages, or 
correlation coefficients, depending on the measuring instrument and 
other circumstances of the experiment. A second table is then pre- 
pared from the first, the entries being the frequency of differences 
for each condition favoring the one method over the other. For each 
experimental condition y? is then computed to determine if the dis- 
tribution of differences differs significantly from the expected dis- 
tribution. If the two methods were equally effective, the expected 
distribution of differences would center about zero with an equal 
number of differences favoring each method. 

The hypothetical data in Table 1 have been prepared to illustrate 
the method. Two methods of instruction in puzzle solving are com- 
pared, demonstration alone and demonstration plus explanation. Six 
groups, each consisting of a subgroup instructed by one method and 
a subgroup instructed by the other method, have been selected in 
random fashion from a population of school children relatively homo- 
geneous in age, intelligence, and school grades. For each group 
the schedule of instruction and test is the same. On Days 1 and 2 a 


43 











PSYCHOMETRIKA 





gor PISt 














STS eTs. Srs Stet 9c T88+ 6°88 T9é+ 
vse 93+ S93 38e+ 912 TH+ v'9S =8'93+ v6T 803+ 
98T 0'0%+ ost T6t+ Sgt OLT+ SLT TLI— S9T O9T— 
ror est vee veer ose PIP+ oe Tert+ ree set 
12s TS3+ T9Z Slot L0G 8 ¥o+ STS Let est TLI— 
680 TIS+ 9°8T £06 ot O'9T f39T+ 8'9T s9T— rot <2e— 
T8r sso 61S T9S+ T68 29+ GIh 3sht+ Ter Sth— 
262 08+ oss STE— 99% T'83+ GLE 63+ Lte Sst 
L6l O'%+ 902 902. 6LT 2st+ S6T 0'6I— Per Ost 
xg 7 dxq dxq dxq dxq 
weg wed weg wad waq woe weq weg weg wed 

sjoquidg sjoquiAg su.19}}eg Sul9qqeg sequin N 
Sox ON sox ON sox 
or oT oT OT or Or or oT OT ot 
~ 9 dnory g dnoary p dnoay ¢ dno oh Z dno. 























(ezep [eotyeyjodAP ) 


(‘OT JO dno1rs B% LOF spuodas UI dUTiZ UBAL BY} ST ArQUA 
uoljeUB[dxy sni[q Uolyeajsuowaq JO puv UoljeAJsSUOWA(] JO SUOT}IPUOD AVpU_ SuIajog ajzzng 
. T WIaVaL 


Log TLeE+ days ¢ 
G86 S33o— days Z 
O25 -SsI—- days [ 
ad4} a[zzng 
gt Avg 
viS T9S— days ¢ 
TSt s6T+ days Z 
oot gst+ days | 
od Ay a[zzug 
a } 7 z Aeq 
Gor 3é6Ee— days ¢ 
TLE &96— days Z 
SLT g°8t+ days [ 
ed Aj o[zzng 
st a ee T Avg 
dxq 
weg wed uolpoN.AysuT 
S1equin N }U9}U0D 9[zzng 
ON preMmoy 
oT or N 
T dnory oe ee ane 
youd) 











HAROLD A. VOSS 45 


period of instruction is followed by a test on three puzzles and two 
weeks later, on Day 15, there is another test on three puzzles but no 
instruction. Use of rewards and differences in puzzle content are 
varied systematically over the groups. To summarize, puzzle solv- 
ing under two types of instruction, demonstration and demonstration 
plus explanaticn, is compared over the following conditions: 


Motivation — reward used 
reward not used 
Test — after first instruction period 
after second instruction period 
two weeks after second instruction period 
Puzzle content — numbers 
geometric patterns 
symbols other than numbers 
Puzzle type — one step 
two step 
three step 


Each pair of entries in Table 1 is marked with a plus (+) if 
the difference favors the demonstration plus explanation group. Zero 
(0) indicates no difference, and minus (—) indicates a difference in 
favor of the demonstration group. Table 2 presents the tabulation 
of the frequency of + and — differences for each experimental con- 
dition. Zero differences give a credit of .5 to each method. For each 
subcategory of the experimental conditions, chi square has been com- 
puted to determine whether the observed distribution of differences 
departs significantly from the expected distribution. As was pointed 
out previously, if neither method is superior, an equal number of 
differences can be expected to favor each method in accordance with 
the familiar null hypothesis. 

In the chi square formula, 


0 = SO (fo — fr)? /f 2] , 
where 
f, = observed frequency; the values used 
in the present case are the frequency 
of plus and of minus differences in 
turn; 


f; = expected frequency, the value used is 
n/2. 


The analysis in Table 2 shows that demonstration plus explana- 
tion is a more effective method of instruction in puzzle solving than 











46 PSYCHOMETRIKA 


TABLE 2 


Analysis in Terms of Frequencies of Differences Favoring One 
Method of Instruction over the Other 


(Based on hypothetical data from Table 1) 























Experimental n ‘, Fe f, aa er 
condition (+) (— (n/2) df=1 
Dem Exp Dem 
Motivation 
Reward 27 22.0 5.0 13.5 10.70 <.01 
No reward 27 17.0 10.0 13.5 1.81 .20—.10 
Test 
Day 1 18 11.5 6.5 9.0 1.39 .30—.20 
Day 2 18 14.0 4.0 9.0 5.56 .02—.01 
Day 15 18 13.5 4.5 9.0 4.50 .05—.02 
Puzzle Content 
Numbers 18 8.0 10.0 9.0 22 -70—.50 
Patterns 18 15.0 3.0 9.0 8.00 <.01 
Symbols 18 16.0 2.0 9.0 10.89 <.01 
Puzzle type 
1 step 18 10.5 7.5 9.0 50 .50—.30 
2 step 18 14.0 4.0 9.0 5.56 .02—.01 
3 step 18 14.5 3.5 9.0 6.72 <.01 
Total experiment 54 39.0 15.0 27.0 10.67 <.01 





*x2=S[(fo-—ft)2/fel. 


demonstration alone. The 7? and corresponding P values provide the 
investigator with sufficient material to draw conclusions about the 
differential effects of the two methods of instruction under the varied 
conditions of motivation, recall, and retention, puzzle content, and 
type. Since the purpose here is to describe a method of analysis and 
not to draw conclusions from hypothetical data, the reader will be 
spared the discussion outlined. However, it should be noted in pass- 
ing that the results of the analysis appear in a concise, understand- 
able form. 

Table 3 presents an analysis of the same data in terms of the 
reliability of mean differences employing the conventional t tech- 
nique. Table 3 strongly substantiates Table 2 with the greatest dif- 
ference being the somewhat higher confidence levels reflected in the 
P values of Table 3. A comparison in terms of the usual interpreta- 
tion of P values (.01 highly significant, .05 significant; > .05 not sig- 
nificant), indicates that in six cases out of twelve the interpretation 
would be the same, in five cases there would be a one-step difference, 
and in one case a two-step difference. The similarity of the two meth- 
ods is further indicated by the rank correlation coefficient between 





Ines |S 


S 


\w 





HAROLD A. VOSS AT 


TABLE 3 


Analysis in Terms of Mean Differences between Two Methods of Instruction 
(Based on hypothetical data from Table 1) 


























Experimental M, M, M, o o t* df P 
condition Dem Dem Exp ses 
Motivation 
Reward 29.10 26.91 2.19 2.45 .48 4.56 26 <.01 
No reward 29.76 28.75 101 1.77 85 .2.89 26 <.01 
Test 
Day 1 31.538 30.27 1.26 235 .57 2.21 417 .05—.02 
Day 2 28.42 26.34 2.08 2.33 .57 3.65 17 <.01 
Day 3 28.32 26.88 144 186 .45 3.20 17 <.01 
Puzzle content 
Numbers 25.51 25.86 15 1.59 29 238 17 .80—.70 
Patterns 28.82 26.27 2.55 2.38 .58 440 17 <.01 
Symbols 33.94 3187 2.07 1.83 .44 4.70 17 <.01 
Puzzle type 
1 step 18.21 17.72 49 94 .238 2.18 17 05—.02 
2 step 25.46 24.26 1.20 169 .41 2.938 17 <.01 
3 step 44.62 41.51 3.11 2.71 66 4.71 17 <.01 
Total experiment 29.48 27.88 1.60 2.22 .80 5.33 58 <.01 
My, - Me 
*¢-———_... 
oa/VN-1 


chi-square and ¢ of .91. If the square of this value may be taken to 
indicate the amount of overlap in the two methods of analysis, then 
the frequency of difference analysis is approximately eighty per cent 
as effective as the ¢ test. It would be pertinent to note at this point 
that the hypothetical data were set up systematically and were not 
juggled in any way later to produce greater correspondence. 

Whatever sacrifice in precision is entailed, the saving in com- 
puting time is considerable. The frequency of difference analysis 
reported here was done in less than thirty minutes without the aid of a 
calculating machine, whereas the mean difference analysis required 
about three hours with a calculator. As regards the alternate technique 
of analysis of variance, one thing is certain — it would have taken far 
more time. An even greater saving of time may be accomplished with 
the aid of the nomograph here provided. It is designed to determine the 
value of chi square when there is one degree of freedom and f; = 7/2. 
The nomograph includes a table of P values corresponding to chi 
square. 

The reader will readily visualize additional applications of the 
frequency of differences method. It is applicable wherever comparative 








48 PSYCHOMETRIKA 


measures of effectiveness are available for methods, instruments, aids, 
chemical compounds, etc. As was noted previously, the measures of 
effectiveness may be scores on tests or other measuring instruments, 
time scores, or statistics based upon individual scores. Some experi- 
mental applications may take advantage of the fact that the technique 
takes into account only the frequency and direction of the differences 
and disregards the magnitude of the differences. 

Whatever precision is lost is due to this disregard of magnitude. 
However, this disadvantage is offset by a number of advantages which 
may be listed as follows: 


1. The method is extremely rapid, involving little com- 
putation and hence little opportunity for error. 

2. It has a wide range of applications, particularly in 
the field of research on training methods and aids. 

3. It yields results which are readily understandable 

- and can be explained to those unfamiliar with 
statistical methods. 

4. It involves no terminology or concepts outside the 
realin of conventional statistics. 


Pn | 


1] 


ie iia le ei ii ee ee a 


] 


T 


T 


T 7 





) era T 











ctor writes ili i Sk lt a 


T T 





1.00 








HAROLD A. VOSS 


X7=F[i1,-2,)7/n) 


where 
f, = frequency observed; 
f, = frequency expected, 


in this case f.= N/2 


DIRECTIONS. Divide the larger value of fo by N. 
Find this value on the p. scale and the Sand of N 
on the N scale. Connect°the two values with a rul- 
er and read the value of X* and the corresponding 
value of P where the ruler cuts the scales so marked, 


EXAMPLE. f, = 6 and 2, n=10. Divide 8 by 10 and 
connect the resulting value, .8, on the po scale with 
10 on the N scale. The ruler cuts the X? scale at 
3.6 and the P scale at .06 approximately. 


NOTE. The nomograph m@y be used to determine chi-square 
only when there is one degree of freedom (two cells) 
and fy = n/2. 


2002 


005 


010 


015 


020 


2030 


040 
050 








PSYCHOMETRIKA—VOL. 12, No. 1 
MARCH, 1947 


AN IN DEX OF ITEM VALIDITY PROVIDING A CORRECTION 
FOR CHANCE SUCCESS 


A. P. JOHNSON 
PURDUE UNIVERSITY 


The KG Index described below is proposed for evaluation as one 
approach to the problem of providing an index giving comparable 
values for items (1) of equal discriminative power at all levels of 
difficulty and (2) of different numbers of alternative responses. 


1. The KG Index 
In 1934 Votaw* suggested that item validity comparisons for 
upper and lower 27% groups of a tested population sample be based 
on the proportion in each group who know the answer to an item 
rather than on the proportion who answer it correctly. 
He gave the general equation: 


nk — N 
ee 
in which 
x} =the number in each group who know the correct answer, 
n =the number of choices in the item, 


R =the number of correct responses to the item within a given 
group, | 


N =the number of cases-in each group. 


Votaw found in some instances that values of x were nega- 
tive. He construed these to mean that either the items in ques- 
tion were (1) so stated as to “trick” examinees into making incor- 
rect responses, (2) keyed incorrectly, or (3) suffering from some 
other serious fault rendering them invalid. 

Without presenting the generalized ratio, he shows that for any 
group all of whom respond to any item and who are completely in 


* Votaw, D. F. Notes on validation of test items by comparison of widely 


—_ ee. J. educ. Psychol., 1984, 25, 185-191. 
ne writer uses hereafter the symbol K for the number of the total test 


: ua who are estimated to know the correct: answer. 


51 











52 PSYCHOMETRIKA 


ignorance of the correct answer N/n of them would be expected to 
mark the correct answer by chance. An experimental study by Rug- 
gles of the marking by a student group of the correct choice in a 
true-false test the material of which was wholly unfamiliar to them 
is mentioned by Lee and Symonds.* The percentage of correct re- 
sponses actually obtained was 51% when the chance expectation was 
50%. Votaw shows that the probability of any invalidating condi- 
tions existing in an item can be determined for any given negative 
value of x for that item by considering N/n plus or minus its prob- 
able error. He reported the P.E. of N/n to be equal to .6745\/Npq , 
where p = 1/n and q= (1—1/n). 


In 1936, Guilford} proposed as a means of evaluating the level 
of difficulty of a test item the proportion passing corrected by an 
allowance for chance success. This corrected proportion, ¢, = 
(nk — N)/N(n—1 ), is simply the proportion of the total test popula- 
tion who may be expected to know the correct answer. In short, 
based on the total R for the item it is K (or Votaw’s x for the entire 
test population) divided by N. 

Thus 
nk —N 
K= . (1) 


n—1 





It is proposed that there be considered a new index, the KG In- 
dex, based not on contrasted upper and lower groups of 25%, 27%, 
or 33%, but on contrasted upper and lower groups equal in size for 
each item to the number of individuals estimated to know the correct 
response to that item. Suppose for a given item that 


R=200, 
Ww =8300, 
N=500, 
n= 5. 


Then 
nmR—N 5X200—500 500 
= = =—-= 125. 
n—1 4 4 
It is postulated that if those 125 persons who are estimated to know 
the correct answer to the given item are the 125 persons who are 
highest on the criterion scale, that item may be said to have a perfect 








* Lee, J. M. and Symonds, P. M. New type or objective tests: A summary 
by recent investigations (October 1931-1938). J. educ. Psychol., 1984, 25, 161- 

+ Guilford, J. P. The determination of item difficulty ane chance success is 
a factor. Psychometrika, 1936, 1, 259-264. 








A. P. JOHNSON 53 


positive relationship to the criterion. If those 125 who were esti- 
mated to know the correct answer are the 125 persons who are lowest 
on the criterion scale, that item may be said to have a perfect nega- 
tive relationship to the criterion. The extent of this relationship 
could be determined by arranging the 500 papers in decreasing or- 
der of criterion scores and determining for that given item how many 
correct responses occurred among the top 125 papers. The more 
nearly that item approached perfect positive relationship with the 
criterion the more nearly the number of correct responses among 
the top 125 papers would, in this instance, approach 125. If this 
item had a perfect negative relationship to the criterion scores, all 
125 who were estimated to know the correct answer would appear 
among the lowest 125 on the criterion scale. The remaining number 
of correct answers, R — K , 200 — 125 or 75, would be distributed 
among the 500 — 125 or 375 papers remaining. Chance expectations 
with 75 rights among 375 papers would be 1 in 5 throughout the 
upper 375 papers on the criterion scale. Thus the upper 125 papers 
would be expected to include about 125/5 or 25 correct responses by 
chance. 

The KG Index can be developed as follows as a convenient sum- 
mary of how closely the actual responses approximate the perfect 
(or deviate from the chance) relationship with the criterion scores, 
as that relationship is defined above. 

Let us use the symbols, Ry for the number of right responses in 
the upper group and R, for the number of right responses in the 
lower group. In the perfect positive relationship postulated above 
the ratio R,/K should equal 1.0, and the ratio R,/K should equal 
1/n (since R, should equal K/n).* 

In a chance relationship both R,/K and R,/K should equal 1/7 .+ 

In a perfect negative relationship as postulated above, the ratio 
R./K should equal 1/nt (since R; should equal K/n), and the ratio 
R,/K should equal 1.0. 

It is possible to obtain a positive index for so-called positive re- 
lationship, a zero index for chance relationship and a negative index 
for the so-called negative relationship by subtracting the ratios Ry/K 
and R,/K. This difference of proportions, Ry — R,/K , is essentially 
the same as the U-L Index proposed by the writer. The two are iden- 


* As Votaw indicates, the value N/n or in this instance K/n may well vary, 
as estimated by the formula S. E. of K/n = VK X 1/n X (n — 1)/n. According 
to the probability tables for the normal curve of error, in 9973 cases out of 10000, 
K/n should not vary beyond + 3 VK X 1/n X (n —1)/. It is thus possible that 
values of less than 1/n may occur. 


Idem. 
t Idem. 

















54 PSYCHOMETRIKA 


tical when the value of K is .27N . The difference between R, — R,/K 
and the U-L Index is that K is not fixed at .27N but may vary from 
0 to N. Whenever K exceeds N/2, however, the upper and lower K 
groups overlap by the amount 2(K — N/2) or 2K — N. For the sake 
of simplicity, the practical question of how to handle omitted re- 
sponses is deferred until later. If the group considered to be guess- 
ing the correct answer (i.e., those not knowing it) is designated by 
G,then N — K=G. If, when K is greater than N/2, the G group 
rather than the K group is made the basis for upper vs. lower group 
comparison, the resulting difference between actual correct responses 
in upper and lower groups is the same and the groups do not over- 
lap. The divisor (G) then is no greater than the effective maximum 
value of the groups whose numbers of correct responses are sub- 
tracted. The symbol B can be substituted for K in the ratio 
Ry, — R,/K, where B (i.e., the base group) is K , when K does not 
exceed N/2, and G when K exceeds N/2. 

Since for perfect positive relationship as defined above the ex- 
pected value of Ry — R,/B is 1—1/n or (n —1)/n and since for per- 
fect negative relationship the expected value of Ry — R,/WN is 
1/n — 1, the ratio in this form has a maximum theoretical value 
dependent on the number of choices. This dependence can be elimi- 
nated by multiplying the ratio R, — R,/B by n/(n — 1). This expres- 
sion is the KG Index: 


n(Ry — R,) 
KG Index = ——————-. (2) 
(n—1)B 


2. Computation of the KG Index 

As a first step, the test papers are arranged from highest to 
lowest on the criterion scale to be used as the standard of validity. 
The number of persons marking the correct response and the number 
marking all incorrect responses in the upper 30% of papers and in 
the lower 30% of papers is determined by graphic item count or other 
means. In order to provide most efficiently for data needed later, it 
is suggested that the papers be divided into successive groups as fol- 
lows: upper and lower 6%, next upper and lower 4%, next upper and 
lower 5%, next upper and lower 5% and the remaining upper and 
lower 10% to total 30% in each.* With the graphic item counter, for 
instance, each upper group is run through separately and a separate 
count obtained on each. The same procedure is followed with the 
lower groups. For example, by adding the data of the 6% and the 


* The use of these suggested groups is believed to provide sufficient accuracy 
while avoiding the necessity of computing different base groups for each item. 








ANS A 


@ 


.' 


\y 
' 


wd 
1 


aww ss 


wm 


A. P. JOHNSON 55 


next 4% groups the information necessary for computing the KG 
Index based on the upper and lower 10% groups for certain specific 
items can be obtained. Similarly the data for any desired base groups 
from 6% to 30% can be found for each item. 

The average number of correct and the average number of in- 
correct responses in the upper and lower 30% groups provide a prac- 
ticable, if not a precise, means of determining the difficulty level of 








each item by the formula: 


K=R 


Ww 


(n—1)- 


(3) 


Divided by .3 N and when there are no omissions, this value becomes 
essentially equivalent to Guilford’s ¢, . 
Table 1 gives a suggested range of values of K in terms of N 
for which upper and lower 30% groups provide convenient base 
groups and for which base groups of smaller size are suggested. 


Table 1 


Recommended Base Groups and Comparable Ranges of K and R* 


Base Range of K Range of R in percentages of N for specified 
Group terms of N numbers of choices 
n=5choices n—A4choices n=—=8 choices n= 2 choices 
.30N .25 — .75N 40.0— 80.0 43.8—81.2 50.0—83.38 62.5 —87.5 
.18 — .24N 84.4—39.9 388.5—43.7 45.4—49.9  59.0— 62.4 
.20N -76 — .82N 80.1—85.6 81.3—86.5 83.4—87.9 87.6—91.0 
18 — .i7N 80.4—384.3 348—38.4 42.0—45.3  56.5—58.9 
.15N .83 — .87N 85.7—89.6 86.6—90.2 880—913 91.1—93.5 
.08 — .12N 26.4—30.38 31.0—34.7 38.6—41.9 540—56.4 
.10N .88 — .92N 89.7—93.6 90.3—94.0 91.4—94.7 93.6—96.0 
.04 — .07N 23.2— 26.3 280—30.9 360—38.5  52.0—53.9 
.06N .938 — .96N 93.7—96.8 941—97.0 948—97.3 96.1—98.0 
0 —.038N 20.0— 28.1 25.0—27.9 33.3—35.9 50.0—51.9 
0 .97-1.00N 96.9 — 100 97.1—100 97.4—100 98.1 — 100 


(Underlined values represent expected percentages correct when all answers 
are marked according to chance.) 


* The suggested base group values of this table have been arrived at on the 
basis of both theoretical and practical considerations; except for a very few spe- 
cial test construction situations it is expected that they will prove quite satisfac- 
tory. 

The values of R have been derived from the basic formula 


K =(nR — N/(n—1) solved for R, namely, R = (rn —1)K/n + N/n. 














56 PSYCHOMETRIKA 


In most instances it is believed that the majority of items will 
be of such difficulty that the upper and lower 30% groups will serve. 
If the suggested breakdown of groups has been followed, the data 
necessary for computing the KG Index will be readily available for 
all items. 

The following data for one item of a five-choice test will illus- 
trate the method for computing the KG Index when the number of 
omissions is negligible: 








R= 268 
W= 96 
O0= 4 (negligible) 
N = 368 
n= 5 
K=R— A = 268 — = = 268 — 24— 244. 
n—1 (5—1) 


In terms of N , K = 244/368 = .663N . 


Entering the second column of Table 1, “Range of K in terms of 
N ,” we note that K for this item falls within the range .25 — .75N 
corresponding to a base group of .30 N (see the first column). 

Thus the desired base group equals .30 X 368 or 110.4. This 
figure is rounded off to 110. From the total count for the upper 30% 
of the test papers and from the total count for the bottom 30% of 
the test papers, the number of correct responses in these groups is 
found. The actual counts were: 


Ry = 107 base group, B= 110 
R,= 47 
n(Ry — R,) 
KG Index = —————_-. 
(n—1)B 
Substituting, we have 
5 (107 — 47) 
KG Index = ———————-= .68.. 
4X 110 


For this item the tetrachoric 7 based on a 66% vs. 34% split was 
.67; the tetrachoric 7 based on a 50% vs. 50% split was .60 and Guil- 
ford’s ¢* based on contrasted top and bottom 25% groups was .66. 

The agreement of these values is not always so close, as is illus- 
trated by the data for a two-choice item where chance successes tend 
to attenuate the ¢ coefficient (and similarly other indices of relation- 


* Guilford, J. P. The phi coefficient and chi square as indices of item validity. 
Psychometrika, 1941, 6, 11-19. 











A. P. JOHNSON 57 


ship not making a correction for chance successes). For a specific 
two-choice item, N = 401, R = 239, W = 158, 0 = 4, and since K = 
.202N , B = .20N. Ry and R, were 62 and 36, respectively; thus 
n(Ry — R,)/(n —1)B=.65. The corresponding ¢ was 1.26; no val- 
ues of the tetrachoric r were computed. 

Formulas (1) and (2) are not applicable where the number of 
omissions is appreciable, for they assume no omissions. 

Formula (3), K= R — W/(n — 1), can be used regardless of 
the number of omissions. It is useful to determine K when it is nec- 
essary to item-analyze speeded tests in which the number of omis- 
sions among later items is appreciable. When K is expressed in terms 
of N, N should include only those who read the item, not the ‘“‘non- 
reads” as defined below. 

Omissions on speeded tests are usually of two types: (a) those 
- by persons who read the item but fail to mark any choice, and (b) 
those by persons who have not read as far in the test as the end of 
that item. Frederick B. Davis in a communication to the writer sug- 
gests a method of determining the “non-reads” directly. The test 
papers are scored and arranged in rank order according to the cri- 
terion score. The top 6%, next 4%, next 5%, next 5%, and remain- 
ing 10%, to total the top 30% of the test papers, are segregated as are 
similar bottom groups. Next the papers of each upper group are 
gone through separately to determine which item on each paper is 
the last one marked. A tally is then made opposite the number of 
the next item, for that one is presumably the first item not read by 
the subject. This tally is the basis for a cumulative frequency table 
of the ‘‘non-reads” for each particular item among the top 6%, top 
10% (6% + 4%), top 15%, top 20%, and top 30% of the test papers, 
A similar count is made of the “non-reads” among the lower 6%, 
10%, 15% 20%, and 30% groups. 

By graphic item count on the International Test Scoring machine 
or by other means, the number of right responses and the number of 
wrong responses in each top group to each item is found. The omits 
may be obtained, if desired, by subtracting from the number of pa- 
pers within the appropriate upper group the rights plus wrongs plus 
‘non-reads.” Similarly the omits among the several lower groups 
may be obtained, if desired. 

Where the “non-reads” represent an appreciable proportion of 
the appropriate base groups, the following modification of the basic 
formula for the KG Index is suggested: 














58 PSYCHOMETRIKA 
n Ry R, 
— (= —1/\(B—“non-reads’,) (B— “non-reads”,) : 


in which n, R,, R,, and B have the same meanings as in Formula 


(2), 





“non-reads”’, = “non-reads” in upper base group, and 
“non-reads’’, = “‘non-reads” in iower base group. 


The postulated standard of perfect positive and of perfect nega- 
tive relationship against which each item is evaluated by the KG In- 
dex is based upon probability theory. In the strictest sense, it is 
valid only when (a) the number of cases is large and (b) when all 
of the alternative responses are equally enticing to those examinees 
in ignorance of the subject matter of the question. Although these 
conditions are not too frequently met in practice, the KG Index is 
believed to possess some possible usefulness as an item validity index 
giving closely comparable values for items (1) of different levels of 
difficulty but equal discriminative power and/or (2) of different num- 
bers of alternative responses. 








(4) 


la 





PSYCHOMETRIKA—VOL. 12, NO. 1 
MARCH, 1947 


ROOK REVIEWS : 


HAROLD CRAMER Mathematical Methods of Statistics. Princeton: Princeton 
University Press, 1946. Pp. xvi + 575. 


It is the purpose of the author to present a logical development of the method 
of mathematical statistics, which presupposes on the part of the reader a mathe- 
matical knowledge of only calculus, algebra, and analytic geometry. The first sec- 
tion of the book, containing 137 pages, is devoted to the presentation of the various 
topics in higher mathematics which are necessary for the proof of theorems in the 
later sections of the book. This section, beginning with point-set theory, contains 
a logical development of Lesbegue measure, the Lesbegue integral, theory of ad- 
ditive set functions, and the Lesbegue-Stieltjes integral. The procedure through- 
out is to make a detailed, rigorous development for the simplest case and to indi- 
cate briefly the possible generalizations of the theorems to less restricted condi- 
tions and to multidimensional space. At the end of the first part of the book, one 
chapter each is devoted to characteristic functions and matrix theory. A final 
chapter of this part covers such topics as Stirling’s formula and Beta and Gamma 
functions. 

This first section of the book may be understandable to European students 
who have finished a course in calculus, but in terms of American education in 
mathematics, it requires more background than that. Especially it requires a 
familiarity with the rigorous development of the calculus and with the manipula- 
tion of the functions of complex variables. The first section is, nevertheless, ex- 
tremely valuable because the selection of pertinent material from function theory 
provides a background for statistics which would be very difficult for anyone who 
is not a professional mathematician to obtain. 

The second part of the book discusses random variables and probability func- 
tions, both univariate and multivariate. The two introductory chapters deal with 
the fundamental basis of probability theory. Cramér develops the concept of 
probability from the empirical knowledge of frequency ratios but does not actually 
base his mathematics upon frequency ratios. Instead probability is a mathematical 
model, a function of point sets, consistent within itself, but designed to have a 
reasonable correspondence with the properties of frequency ratios. The discussion 
of random variables includes the important binomial, Poisson, normal, chi-square, 
t, z, and incomplete Beta distributions as well as explanations of Gram-Charlier 
series and the Pearson system of distributions. General properties of multivariate 
distributions are also discussed. 

The third section, on statistical inference, includes two main types of mate- 
rial, general theory of statistical tests and inferences and the details of particu- 
lar sampling distributions. The section on the theory of estimation and the chap- 
ter on the theory of testing hypotheses are particularly valuable. The discussion 
of the topic, however, seems less complete than the other sections of the book. 
Cramér probably felt that the other material was mathematically more funda- 
mental. 

For psychologists the book will be very useful, almost indispensable, to those 
who are trying to develop real mathematical sophistication in statistics. After 


59 








60 PSYCHOMETRIKA 


careful study of Cramér, the reader should be in a position to understand more 
advanced texts and the periodical literature. For the person who wishes to become 
a sophisticated user of statistics, but not necessarily a mathematical statistician, 
Cramér is also valuable but less so. His discussion is neither complete enough nor 
non-technical enough to give such a reader an understanding of the full impli- 
cations of such concepts as “uniformly most powerful test,” or “unbiassed test.” 
The book is not, nor was it intended to be, a handbook of statistical tests. In that 
sense also it is not of maximum value for the practical statistician. There is, how- 
ever, such a scarcity of integrated discussions of the modern theory of statistical 
tests and inferences that even for the less mathematical reader the book is valu- 
able. 
Fels Research Institute for the Study of Human Development 
ALFRED L. BALDWIN 


DAVIS, FREDERICK B. Item-Analysis Data: Their Computation, Interpreta- 
tion, and Use in Test Construction. Harvard Education Papers Number 2. 
Cambridge: Graduate School of Education, Harvard University, 1946. Pp. 
v + 42, 


This monograph does not undertake to provide a complete review and discussion 
of techniques of item-analysis. It is limited to (1) the exposition of one procedure 
which the author has developed for analyzing and expressing the difficulty and 
discriminating power of items and (2) critical remarks on certain specific prob- 
lems in the interpretation and use of item-analysis data. In addition, a four-page 
bibliography on item-analysis is provided. 

The distinctive feature of the procedures presented by the author is that they 
yield indices scaled in presumably equal units, with a range from 0 to 100. In the 
case of difficulty, defined as per cent of the group knowing the answer to an item, 
this involves assuming that ability is normally distributed in the group studied and 
then converting percentages into abscissa values of the normal probability curve. 
These are then multiplied by an appropriate constant to give the desired range of 
seores. In the case of discrimination indices, correlation coefficients are translated 
into values of Fisher’s z, and these are multiplied by the constant which yields 
a value of 100 corresponding to 99 per cent success in the upper 27 per cent of 
cases and 1 per cent success in the lower 27 per cent. 

Both difficulty and discrimination indices are extracted from the per cent of 
successes, failures, and omissions in the top and bottom 27 per cent of the group 
on the criterion measure (usually total score). A chart has been prepared which 
provides the difficulty and discrimination indices for any pair of percentages of 
success (after correction for chance) in the two groups. Thus, -the procedure and 
the table are an elaboration of those developed by Flanagan earlier. . 

The procedure of obtaining item indices from per cent of success in the upper 
and lower 27 per cent has been found to be an efficient and practical procedure, 
especially where an IBM test-scoring machine with a graphic item counter attach- 
ment is available. The use of scaled values for difficulty and discrimination indices 
gets away from the non-linearity of the number scale both of proportions and of 
correlation coefficients. However, the numeral values of the proposed new indices 
will be entirely unfamiliar to the user and will present some difficulty of interpre- 
tation on that account. Their value would appear to be chiefly for individuals who 
are going to do a great deal of item analysis and in cases where the item indices 





nore 
ome 
ian, 
nor 
pli- 
a 
hat 
OwW- 
ical 

lu- 


ire 











ROOK REVIEWS 61 


are to be used for comparative purposes within a test-development organization. 
The indices would probably prove a source of confusion in published reports. 

Chapter IV of the monograph presents a stimulating discussion of various 
problems connected with the use and interpretation of item-analysis data. In 
general, the tenor of these remarks is to emphasize that item-analysis is a valuable 
supplementary aid to but not a substitute for good item-writing and editing, and 
that item-analysis data should be used with insight and discretion rather than 
mechanically. Though the discussions are brief and suggestive rather than defini- 
tive in many cases, they should stimulate the reader to critical thought on a num- 
ber of phases of the item-analysis problem. 


Teachers College, Columbia University ROBERT L. THORNDIKE 

















SPECIAL NOTICE 


The U. S. Civil Service Commission has announced an exami- 
nation for filling Research Psychologist positions in Washington, 
D. C., and throughout the United States. 

The salaries for Research Psychologist positions range from 
$4,902 to $9,975 a year. The duties of the positions are of a highly 
responsible and technical] nature. To qualify, applicants must have had 
4 years of progressive professional experience in conducting or par- 
ticipating in important research projects in the field of psychology. 
This experience must have been at successively higher levels of respon- 
sibility for the higher grades, and must show ‘he applicant’s ability to 
plan, direct and coordinate research program: of considerable scope 
and complexity. Applicants for the highest ssiary level must have 
earned recognition as leaders in the field of psychology. Graduate 
study in psychology may be substituted year for year for 3 years of 
the required experience. No written test is required for this exami- 
nation. The age limit of 62 years is waived for persons entitled to 
veteran preference. 

Applications for the Research Psychologist examination will be 
accepted until further notice. However, some positions will be filled 
immediately. Persons interested in these positions should apply at 
once. Information and application forms may be obtained at most 
first- and second-class post offices, from Civil Service regional offices, 
and from the U. S. Civil Service Commission, Washington, D. C. 


63 





“ELT ESN NAN REL one 








